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Abstract 

One  of  the  major  disadvantages  of  paral- 
lel programming  with  shared  memory  is  the 
nondeterministic  behavior  caused  by  uncoordi- 
nated access  to  shared  variables,  known  as  ac- 
cess anomalies.  Monitoring  program  execution 
to  detect  access  anomalies  is  a  promising  and 
relatively  unexplored  approach  to  this  prob- 
lem. We  present  a  new  algorithm,  referred  to 
as  task  recycling,  for  detecting  anomalies,  and 
compare  it  to  an  existing  algorithm.  Empirical 
results  indicate  several  significant  conclusions: 
(i)  While  space  requirements  are  bounded  by 
0(T  x  V),  where  T  is  the  maximum  number  of 
threads  that  may  potentially  execute  in  paral- 
lel and  V  is  the  number  of  variable  monitored, 
for  typical  programs  space  requirements  are  on 
average  O(V).  (ii)  Task  recycling  is  more  effi- 
cient in  terms  of  space  requirements  and  often 
in  performance,  (iii)  The  general  approach  of 
monitoring  to  detect  access  anomalies  is  prac- 
tical. 

1      Introduction 

Erroneous  non-deterministic  behavior  in  shared  mem- 
ory   parallel    programs    is   often    the    result    of   access 
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anomalies.  An  access  anomaly  occurs  when  two  concur- 
rent execution  threads  access  the  same  memory  location 
in  an  "unsafe"  manner:  more  specifically,  when  either 
two  concurrent  threads  both  write  or  one  reads  and  one 
writes  a  shared  memory  location  without  coordinating 
these  accesses.  The  program  segment  in  Figure  1  illus- 
trates an  access  anomaly.  The  doall  construct  creates 
two  parallel  threads  that  both  write  the  variable  .V.  The 
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doall  i  =  1  to  2 
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endall 
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Figure  1:  Simple  Program  With  an  Access  Anomaly 

value  of  A"  subsequently  used  to  calculate  Z  depends  on 
whether  iterate  i  =  1  or  i  =  2  updates  X  last.  There- 
fore the  two  writes  to  X  are  anomalous.  Variable  Y 
is  also  accessed  in  both  concurrent  threads  i  =  1  and 
i  =  2;  however,  this  is  not  an  anomaly  because  both  ac- 
cesses are  reads.  Similarly,  the  assignment  to  Y  in  the 
first  statement  does  not  cause  an  anomaly  because  it  is 
always  performed  before  either  read.  While  certain  in- 
ternal non-determinism  [5]  does  not  affect  the  outcome 
of  programs  and  is  safe  (e.g.  the  non-deterministic  or- 
der in  which  a  lock  is  granted),  access  anomalies  often 
introduce  erternal  non-determinism  that  modifies  exe- 
cution results  Although  some  programs  are  designed 
to  contain  access  anomalies,  this  type  of  external  non- 
ileterminism  is  usually  unintentional.  Traditional  de- 
bugging techniques  are  of  limited  use  in  finding  errors 


caused  by  access  anomalies  because  such  bugs  are  sen- 
sitive to  timing. 

Detecting  access  anomalies  by  monitoring  program 
execution  is  proposed  in  [12,  10,  11,  4],  Using  this 
approach,  access  anomaly  are  detected  much  in  the 
same  manner  that  array  subscript  range  checking  is  per- 
formed. When  a  variable  is  accessed  during  execution, 
an  immediate  check  is  make  to  see  whether  the  access 
conflicts  with  a  previous  access,  in  which  case  the  error 
is  reported.  This  approach  may  be  used  in  conjunction 
with  static  analysis  [5,  2]  and  is  much  more  efficient  that 
trace-based  post-mortem  methods  [1,  3,  8]. 

Thus  paper  describes  a  new  on-the-fly  access 
anomaly  algorithm  called  task  recycling1  and  compares 
it  to  an  existing  method.  English- Hebrew  labeling  [9,  10]. 
These  methods  were  implemented  in  the  parallel  For- 
tran compiler  on  the  NYU  Ultracomputer  [6].  Both  al- 
gorithms record  read  and  write  events  in  access  histories 
that  are  associated  with  monitored  variables.  While  in 
the  worst  case  the  length  of  an  access  history  depends 
on  the  amount  of  parallelism  in  the  program,  experi- 
mental results  show  that  access  histories  in  fact  rarely 
have  more  than  a  few  entries2. 

The  two  algorithms  differ  in  the  methods  used 
to  determine  whether  different  threads  are  concurrent. 
The  experiment  shows  that  task  recycling  is  less  effi- 
cient for  maintaining  concurrency  information  at  paral- 
lel operations  than  English-Hebrew  labeling.  However, 
it  is  more  efficient  when  checking  for  anomalies  at  read 
and  write  events  and  requires  considerably  less  storage. 
Since  read  and  write  events  are  generally  more  frequent 
than  parallel  operations,  we  conclude  that  task  recy- 
cling is  an  improvement  over  English-Hebrew  labeling 
both  in  performance  and  space. 

Finally,  the  experiment  measures  the  actual  cost  of 
monitoring.  For  the  benchmark  programs,  monitoring 
entails  a  3-fold  to  6-fold  slowdown.  Although  this  cost 
is  high,  it  is  not  unreasonable  during  a  debugging  phase 
of  program  development.  Moreover,  this  is  the  cost  of 
monitoring  every  shared  variable  reference.  In  general, 
static  analysis  or  user  directives  may  be  used  to  limit 
the  number  of  variables  monitored,  and  thus  greatly  re- 
duce the  cost.  Section  2  states  the  anomaly  detection 
problem  in  terms  of  the  partial  order  execution  graph 
and  describes  access  histories.  The  two  algorithms  for 
determining  concurrency  relationships  are  presented  in 
Sections  3  and  4.  The  experiment  is  presented  in  Sec- 
tion 5,  and  the  results  summarized  in  Section  6. 


This  algorithm  relies  of  version  numbers  and  parent  vectors, 
which  are  similarly  used  in  [3]  for  post-mortem  analysis. 

Access  history  based  methods  are  therefore  superior  to  the 
method  described  in  [ll]  in  which  the  size  of  the  access  data 
depends  on  the  parallelism 


2      Definitions  and  Framework 

The  Ultracomputer  parallel  Fortran  compiler  provides 
the  nestable  doall-endall  construct  as  its  primary  con- 
currency primitive.  Concurrent  threads  are  created  by 
a  doall  operation  and  terminated  by  a  corresponding 
endall  operation.  Coordination  primitives  provided  as 
library  routines  may  be  classified  as  synchronous  or 
asynchronous3.  If  two  threads  coordinate  synchronously 
(e.g.  via  a  barrier  operation)  neither  thread  may  ex- 
ecute beyond  the  coordination  point  until  both  have 
reached  it.  If  two  threads  coordinate  asynchronously 
(e.g.  via  a  signal  operation)  the  receiver  may  not  exe- 
cute beyond  the  coordination  point  until  the  sender  has 
reached  it;  however,  the  sender  may  proceed  immedi- 
ately. A  block  is  an  instruction  sequence,  executed  by  a 
single  thread,  that  does  not  include  doall,  endall  or  co- 
ordination operations.  Hence  a  thread  is  made  up  of  a 
sequence  of  blocks  that  reflect  its  interaction  with  other 
threads. 

The  concurrency  relationship  among  blocks  is  rep- 
resented by  a  directed  partial  order  execution  graph, 
(POEG).  A  POEG  captures  Lamport's  "happens  be- 
fore'' relation  and  imposes  a  partial  order  on  the  set 
of  blocks  that  make  up  an  execution  instance  [7].  An 
edge  is  either  a  block  or  coordination  edge;  a  coordina- 
tion edge  connects  vertices  of  two  coordinating  blocks. 
A  vertex  corresponds  to  a  doall,  endall  or  coordination 
operation.  Concurrency  determination  is  independent 
of  the  number  and  relative  execution  speeds  of  proces- 
sors executing  the  program.  A  block  6;  is  an  ancestor 
of  a  block  b(  if  there  is  a  path  from  bj  to  6;  in  the  graph 
(equivalently,  6,  is  a  descendant  of  bj).  Two  blocks  are 
concurrent  iff  neither  is  an  ancestor  of  the  other.  We 
define  the  maximum  concurrency  of  a  POEG  to  be  the 
maximum  number  of  mutually  concurrent  blocks.  To  il- 
lustrate these  definitions,  consider  the  POEG  in  Figure 
2.  Block  63  is  concurrent  with  blocks  64,65  and  62;  it 
is  not  concurrent  with  block  60  and  b\  (which  are  an- 
cestors) or  with  blocks  b$  -  613  (which  are  descendants) 
The  maximum  concurrency  of  the  graph  is  four 

In  access  history  based  algorithms,  each  block  has 
an  associated  label.  The  access  htstory  for  a  variable  X 
is  a  set  of  labels  of  the  blocks  which  have  read  and  writ- 
ten A".  Every  time  variable  X  is  read  or  written,  the 
access  history  is  examined  to  determine  whether  the  cur- 
rent event  conflicts  with  a  previous  one.  When  block  6 
reads  X .  it  must  be  determined  whether  6  is  concurrent 
with  the  writers  in  the  access  history  for  X .  When  block 
6  writes  A',  tests  are  performed  to  determine  whether  6 
is  concurrent  with  any  of  the  blocks  in  the  access  history 
of  X    Therefore,  the  efficiency  of  an  access  anomaly  al- 


We  assume  that  all  interthread  coordination  is  explicit;  we 
do  not  attempt  to  solve  the  problem  of  automatically  detecting 
coordination 
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Figure  2:  Partial  Order  Execution  Graph 


Figure  3:  A  Partial  Order  Execution  Graph  with  Access 
Anomalies 


gorithm  depends  primarily  on  how  quickly  the  test  of 
concurrency  can  be  made  and  how  many  entries  are  in 
the  access  history.  We  next  show  how  access  histories 
can  be  compressed. 

Consider  the  accesses  to  variable  X  in  the  POEG 
in  Figure  3.  The  write  of  X  in  69  conflicts  with  the 
reads  of  X  in  62  and  bg.  Suppose  X  is  read  in  bn,  be, 
and  63  in  that  order,  and  then  written  in  69.  After  the 
first  read,  the  access  history  of  X  contains  bn-  When  X 
is  next  read  in  b$,  b$  is  added  to  the  access  history  and 
62  is  deleted.  The  62  read  event  is  no  longer  needed: 
any  subsequent  write  event  that  conflicts  with  b$  also 
conflicts  with  63.  On  the  other  hand,  when  63  is  added 
to  the  access  history,  b^  cannot  be  deleted.  Otherwise, 
no  anomaly  can  be  detected  when  X  is  written  in  69, 
since  this  write  does  not  conflict  with  the  read  in  63. 

More  generally,  a  block  6  in  the  reader  set  of  a  vari- 
able X  can  be  deleted  when  a  descendant  of  6  accesses 
X.  Thus,  the  reader  set  of  an  access  history  contains 
two  blocks  only  if  they  are  concurrent.  (On  the  other 
hand,  since  two  concurrent  writes  always  conflict,  there 
is  at  most  one  writer  in  an  access  history.)  By  deleting 
obsolete  entries,  the  size  of  an  access  history  is  therefore 
bounded  by  the  maximum  concurrency  of  the  POEG. 
We  show  in  Section  5  that  this  bound  is,  in  fact,  much 
too  pessimistic.  Although  compaction  reduces  the  size 
of  the  reader  and  writer  sets,  certain  information  may 
be  lost.  In  particular,  if  there  are  multiple  anomalies 
involving  the  same  variable  some  of  them  may  not  be 
reported.  However,  at  least  one  anomaly  is  guaranteed 
to  be  reported  for  every  variable  which  is  accessed  in  an 


"unsafe"  manner. 

3      English-Hebrew  Labeling 

In  the  English-Hebrew  labeling  algorithm,  the  structure 
of  the  POEG  -  and  hence,  the  concurrency  relationship 
among  blocks  -  is  encoded  in  tags  [10,  9]  associated  with 
blocks  A  tag  consists  of  a  pair  of  labels:  an  English  la- 
bel E  and  a  Hebrew  label  H.  Conceptually,  the  English 
label  is  produced  by  performing  a  left-to-right  preorder 
numbering  of  the  POEG;  each  block  is  assigned  a  num- 
ber less  than  all  of  the  numbers  assigned  to  its  children 
and  its  siblings  to  the  right.  However,  since  this  label 
must  be  generated  on-line,  a  complete  traversal  of  the 
execution  graph  cannot  be  performed.  Therefore,  a  la- 
bel is  a  string  of  numbers  and  labels  are  lexicographically 
ordered.  The  children  blocks  Co  .  .  cm  of  a  doall  vertex 
with  parent  block  p  are  assigned  English  labels: 

doall"   E(tagc%)  <—  E{tagp)  \  i 

where  |  is  the  append  operation.  Similarly,  a  block  c 
created  in  a  coordination  operation  which  has  parent  p 
is  assigned  an  English  label: 

coordination:    E(tagc)  •—  E(tagp)  |  1 

The  child  block  c  of  an  endall  vertex  with  parents  po 
.  .  .  pm  is  assigned  English  label: 

endall     E(tagc)  —  max(E(tagPi)) 

The  Hebrew  labels  are  created  symmetricly  for  a  nglit- 
to-left  ordering;  the  only  difference  is  that  the  i,fl  child 
of  a  doall  operation  has  /  +  1  —  «  as  the  last  entry  in  its 
Hebrew  tag.  As  specified  above,  the  length  of  the  labels 


increase  with  the  number  of  doall  and  coordination  op- 
erations. However,  an  additional  heuristic  described  in 
[10]  bounds  the  label  length  to  the  level  of  nesting. 

Two  tags  tagi  and  tagj  are  unordered  iff  either  of 
the  following  conditions  is  met  : 

E(tag,)  <  E(tagj)  and  H{tagi)  >  H(tagj)  or 
E(tagi)  >  E(tagj)  and  H(tag,)  <  H(tagj) 

English-Hebrew  tags  only  encode  the  ordering  due  to 
doall  and  endall  operations;  if  two  blocks  are  ordered  in 
the  POEG  because  of  explicit  coordination,  their  tags 
are  unordered.  Therefore,  an  additional  mechanism  is 
needed  to  record  execution  orderings  imposed  by  coor- 
dination. To  this  end,  a  coordination  list  is  associated 
with  each  executing  block  b  and  consists  of  the  tags  of 
the  ancestors  of  6  such  that  all  tags  are  unordered.  The 
test  for  concurrency  between  two  blocks  6  and  c  requires 
determining  dtagt,  or  any  of  the  tags  in  list),  are  ordered 
with  tagc.  Thus,  the  total  amount  of  work  is  bounded 
by  the  size  of  the  coordination  lists  times  the  length  of 
the  tags. 

The  length  of  a  coordination  list  is  bounded  by  the 
maximum  concurrency  of  the  POEG.  When  a  block  ter- 
minated, its  coordination  list  is  deleted  so  that  the  to- 
tal number  of  coordination  lists  is  also  bounded  by  the 
maximum  concurrency  of  the  POEG.  If  tags  are  stored 
directly  in  the  access  history,  storage  can  potentially  be 
large  and  there  is  a  larger  constant  overhead  because 
tags  are  variable  length.  On  the  other  hand,  if  tags 
are  stored  indirectly  (i.e.  if  access  histories  consist  of 
pointers  to  tags),  tags  must  be  saved  throughout  the 
execution  requiring  storage  proportional  to  the  number 
of  blocks. 

Figure  4  illustrates  the  use  of  English-Hebrew 
labels  for  a  POEG.  The  coordination  list  of  block 
1131, 1231  contains  the  tag  121,  113  because  of  the  coor- 
dination edge  in  the  POEG.  Likewise,  blocks  1211,  1131 
and  123,  113  have  coordination  lists  containing  the  tag 
113, 123.  However,  block  123,  123  does  not  have  a  co- 
ordination list  because  all  of  the  tags  in  the  coordina- 
tion lists  of  its  parent  blocks  are  ordered  with  123,123. 
We  can  determine  that  blocks  111,123  and  12,11  are 
concurrent  because  their  tags  are  unordered;  namely, 
the  first  condition  is  satisfied:  111  <  12  and  123  >  11. 
However,  blocks  1131,1231  and  12,11  are  not  concurrent 
because  an  entry  in  the  concurrency  list  of  1131,1231  - 
121,113-  is  ordered  with  12,11. 

4      Task  Recycling  Algorithm 

Task  Recycling  reduces  the  cost  of  testing  whether  two 
blocks  are  concurrent  to  an  array  reference  while  in- 
creasing the  expense  of  concurrency  information  main- 
tenance. Instead  of  a  tag,  each  block  has  a  unique  task 
identifier,  which  consists  of  a  task  and  a  version  num- 
ber. Tasks  can  be  recycled;  that  is,  more  than  one  block 
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Figure  4:  English-Hebrew  Labeling 


can  be  assigned  to  the  same  task  at  different  times  in 
the  program  execution.  The  version  number  of  a  task 
identifier  is  used  to  distinguish  among  different  blocks 
assigned  to  the  same  task;  every  time  a  block  is  as- 
signed to  a  task  t,  the  associated  version  number  v  is 
incremented.  The  label  storage  problem  is  therefore  ad- 
dressed because  the  number  of  labels  is  bounded  and 
they  are  easy  to  store  directly. 

Concurrency  information  is  maintained  in  a  parent 
vector  which  is  associated  with  each  currently  executing 
block4.  The  number  of  entries  in  a  parent  vector  is  equal 
to  the  number  of  tasks.  The  tth  entry  in  the  parent 
vector  for  block  b  contains  the  largest  version  number 
associated  with  those  ancestors  of  6  which  were  assigned 
to  task  t.  New  parent  vectors  are  formed  from  those 
vectors  associated  with  parent  blocks.  In  particular, 
when  blocks  pi  . .  .  pm  with  task  identifiers  tlVl  ■■■tmVnx 
create  a  new  block  c  the  parent  vector  of  c  is  initialized 
as  follows: 
for  i  =  1  to  T  do 


if  3, 


i  =  t}  then 


Jl,€<tl        lm} 

parent  r[i]  •—  v; 
else  parentc[i\  «—  max(parenrPl  [i], 
endfor 


,parentPm[i]) 


A  block  6  is  concurrent  with  a  block  with  task  identifier 
t„  iff  parenti,[t]  <  v.  Thus,  there  is  a  constant  cost  (an 
array  access)  for  checking  whether  two  blocks  are  con- 


4  Parent  vectors  correspond  to  before  vectors  in  [3,  8].  However, 
because  we  monitor  on-line,  we  do  not  need  the  corresponding 
after  vectors  used  for  post-mortem  trace  analysis. 


current.  (Coordination  lists,  on  the  other  hand,  must  be 
searched  linearly).  As  with  English-Hebrew  labels,  only 
currently  executing  blocks  need  parent  vectors;  storing 
a  task  identifier  in  an  access  history  is  sufficient  to  deter- 
mine whether  two  blocks  are  concurrent.  Once  a  block 
terminates,  its  parent  vector  is  discarded. 

Figure  5  shows  a  task  assignment  and  parent  vec- 
tors for  a  POEG.  In  this  example,  block  I3  is  concurrent 


Task  Ids 

Parent  Vector 

u 

[0,0,0,0,0,0] 

l2,2i 

[1,0,0,0,0,0] 

13,31,4! 

[2,0,0,0,0,0] 

2t,  5i,  61 

[1,1,0,0,0,0] 

14 

[3,0,1,1,0,0] 

15,23 

[4,2,1,1,0,0] 

24 

[4,3,1,1,1,1] 

16 

[5,4,1,1,1,1] 

Figure  5:  Task  Recycling  Assignment 

with  block  2i  since  the  second  entry  in  the  parent  vector 
of  13  -  [2,0,0,0,0,0]  -  is  less  than  1.  However,  block  15  is 
not  concurrent  with  2i  because  the  second  entry  in  its 
parent  vector  -  [4,2,1,1,0,0]  -  is  greater  than  1. 

A  block  6  is  validly  assigned  to  a  task  t  iff  it  is 
not  concurrent  with  any  other  block  previously  assigned 
to  t.  The  goal  of  task  assignment  is  to  minimize  the 
number  of  tasks  used  while  maintaining  a  valid  task 
assignment  since  this  determines  the  length  of  parent 
vector.  Our  task  assignment  algorithm  -  from  which 
this  method  derives  its  name  -  is  based  on  a  "most  re- 
cently used"  (MRU)  heuristic  and  yields  a  good  task  as- 
signment; there  is  no  on-line  task  assignment  algorithm 
which  guarantees  an  optimal  assignment  [4]. 

To  implement  the  MRU  algorithm,  a  free  task  dag 


is  maintained.  There  is  a  vertex  in  the  dag  associated 
with  every  doall,  endall  and  coordination  operation,  and 
every  currently  executing  block.  (These  latter  vertices 
are  always  leaf  vertices).  The  direction  of  edges  in  the 
dag  is  the  reverse  of  the  execution  flow.  A  set  of  free 
tasks,  i.e.  those  tasks  that  are  available  for  recycling,  is 
associated  with  each  vertex.  When  a  block  6  terminates 
at  a  doall  or  endall  operation  O,  it  adds  its  task  to  the 
free  task  list  of  the  vertex  vo  associated  with  O.  When  a 
block  terminates  in  a  coordination  operation,  its  task  is 
added  to  vo  and  the  children  its  associated  block  t»j  are 
added  to  the  child  list  of  vo-  When  a  block  6  is  created 
after  operation  O,  an  edge  is  added  from  the  vertex  vt, 
associated  with  6  to  vertex  vo-  To  assign  a  task  to  a 
block  b,  a  topological  traversal  of  the  dag  is  performed 
starting  at  t'j  until  a  vertex  with  a  non-empty  free  task 
set  is  found. 

Two  optimizations  minimize  the  cost  of  this  topo- 
logical traversal.  Whenever  a  free  task  set  of  a  vertex  v 
becomes  empty,  the  dag  is  collapsed  by  deleting  v  from 
the  dag  and  adding  edges  from  the  parent  vertices  of  v 
to  children  vertices  off.  Therefore,  every  vertex  in  the 
free  task  dag  has  at  least  one  free  task  associated  with 
it.  Whenever  a  vertex  v  has  a  single  parent  vertex  p,  1 
is  subsumed  by  p  by  adding  the  free  task  set  of  t;  to  the 
free  task  set  of  p,  deleting  v  and  adjusting  the  edges  as 
appropriate. 

Although  the  cost  of  performing  task  assignment  is 
potentially  quite  high,  for  a  large  class  of  programs  it  is 
very  limited.  In  particular,  for  programs  with  no  nested 
parallelism  and  programs  with  nested  parallelism  but  no 
coordination  a  constant  amount  of  work  is  performed  in 
assigning  a  task  to  a  block. 

5      Empirical  Results 

The  preceding  section  presents  two  algorithms  for  de- 
tecting access  anomalies  in  parallel  programs.  In  gen- 
eral, English-Hebrew  labeling  will  perform  better  than 
task  recycling  for  programs  with  little  coordination  and 
frequent  fine-grained  doall  operations  because  of  the  rel- 
atively high  cost  of  maintaining  parent  vectors  and  task 
assignment.  However,  task  recycling  is  clearly  more  effi- 
cient than  English-Hebrew  labeling  in  one  way:  the  cost 
per  variable  access  is  less.  Because  this  is  frequently 
the  most  common  operation,  this  is  an  important  at- 
tribute of  task  recycling.  Additionally,  there  is  almost 
always  a  large  storage  penalty  for  saving  tags  in  the 
English-Hebrew  labeling  algorithm.  Moreover,  if  th^rf 
is  frequent  coordination,  the  cost  of  maintaining  coor- 
dination lists  approaches  the  cost  of  maintaining  parent 
vectors,  so  that  any  advantage  of  English-Hebrew  label- 
ing is  diminished. 

General  discussions  of  algorithm  design,  however 
give  incomplete  insight  into  the  costs  encountered  when 


detecting  anomalies  in  actual  parallel  programs.  In  or- 
der to  better  evaluate  the  two  algorithms  -  as  well  as  to 
gain  insight  into  actual  concurrency  structure  and  mem- 
ory access  patterns  -  we  monitored  several  benchmark 
scientific  parallel  programs: 
Triso  -  Solves  a  sparse  triangular  linear  system 

of  equations  using  wavefront  parallelism 
Finite         -  Solves  a  linear  system  using  finite 

element  methods 
Simple         -  Solves  partial  differential  equations  for 

hydrodynamics  and  heat  conduction 
Polymer     -  Performs  molecular  dynamic  calculations 
of  polymer  systems 

Our  first  goal  in  performing  these  benchmarks  is  to 
obtain  representative  values  for  parameters  that  mea- 
sure concurrency  structure  and  shared  variable  access 
patterns  for  parallel  program.  This  gives  us  a  general 
idea  about  the  applicability  of  various  monitoring  tech- 
niques. Our  second  goal  is  to  measure  the  actual  perfor- 
mance impact  of  monitoring  parallel  programs.  English- 
Hebrew  and  task  recycling  are  compared  on  both  their 
execution  time  and  space  requirements.  The  implemen- 
tation and  benchmark  results  are  discussed  in  the  fol- 
lowing sections. 

5.1      Implementation 

The  two  monitoring  algorithms  are  implemented  for  the 
Ultracomputer  parallel  Fortran  compiler.  The  access 
anomaly  detection  algorithms  are  implemented  by  two 
|  library  packages  mostly  written  in  C.  The  routines  for 
maintaining  access  histories  are  written  in  assembler  for 
reasons  of  efficiency.  A  front-end  pre-processor  allocates 
the  storage  for  access  histories  and  inserts  calls  to  the 
run-time  libraries  in  the  program  being  monitored.  This 
implementation  is  relatively  portable  to  other  parallel 
Fortran  system.  The  pre-processor  approach  also  sim- 
plifies the  implementation;  however,  the  resulting  moni- 
toring efficiency  is  worse  than  if  the  compiler  were  mod- 
ified. 

Several  optimizations  were  made  to  task  recycling 
and  English-Hebrew  algorithms  during  implementation: 

•  For  every  monitored  variable  there  is  a  "mirror" 
variable  which  contains  its  access  history.  Thus. 
when  a  block  reads  A[i],  the  access  history  in 
<4a/,[i]  is  checked  for  anomalies  and  then  updated. 
The  access  history  lengths  are  fixed  because  of  16 
megabyte  memory  constraints.  While  this  limita- 
tion may  result  in  undetected  anomalies,  we  believe 
that  anomalies  will  be  missed  only  rarely,  since  the 
measurements  discussed  in  Section  5.3  indicate  that 
the  size  of  the  reader  set  is  generally  very  small. 
Each  entry  in  an  access  history  contains  a  task  iden- 
tifier or  a  pointer  to  the  English-Hebrew  label  and 
the  line  number  and  a  pointer  to  the  function  name 
of  the  last  access;  this  enables  us  to  identify  the  lo- 


cations in  the  program  code  where  access  anomalies 
occur. 

•  English-Hebrew  labels  consist  of  a  string  of  two 
byte  integers  and  a  length  field.  They  are  stored 
indirectly  in  access  histories  and  coordination  lists 
This  decreases  the  size  and  complexity  of  access  his- 
tories. There  is  no  attempt  made  to  reclaim  space 
for  labels  no  longer  referenced.  We  feel  the  addi- 
tional expense  of  maintaining  reference  counts  out- 
weighed the  benefits  of  possibly  decreasing  storage 
requirements. 

•  Blocks  can  share  parent  vectors  (in  task  recycling) 
or  coordination  lists  (in  English-Hebrew  labeling)  if 
they  have  identical  sets  of  ancestors.  For  example, 
all  blocks  created  by  a  doall  operation  share  the 
same  ancestors;  one  of  these  blocks  must  obtain  a 
private  parent  vector  or  coordination  list  only  if  it 
terminates  before  the  endall  (e.g.  by  performing 
a  nested  doall  or  coordination  operation).  For  a 
nested  doall  operation  with  outer  parallelism  of  m 
and  inner  parallelism  of  n,  this  reduces  the  number 
of  parent  vectors  and  coordination  lists  by  a  factor 
of  nm.  5 

•  Both  algorithms  benefit  from  a  "run-until- 
completed"  scheduling  paradigm,  which  is  the  com- 
mon scheduling  method  used  in  the  Ultracomputer 
and  many  other  parallel  Fortran  environments. 
Parallel  scientific  codes  typically  exhibit  a  high 
degree  of  nominal  parallelism.  In  the  run-until- 
completed  model,  each  block  created  in  a  doall  will 
not  block  until  it  terminates  at  its  associated  en- 
dall operation.  This  limits  the  number  of  blocks 
which  can  ever  run  concurrently  to  n,  the  underly- 
ing parallelism  of  the  machine.  This  is  not  a  strong 
optimization  for  English-Hebrew  labeling,  since  the 
majority  of  its  space  requirement  is  for  block  labels; 
it  is  more  effective  for  task  recycling  since  parent 
vector  storage  is  more  significant. 

•  In  the  Ultracomputer  parallel  Fortran  environment, 
each  of  n  actual  processes  perform  the  work  of  sev- 
eral parallel  blocks.  The  differences  among  the  par- 
ent vectors  of  these  blocks  are  very  small.  There- 
fore, in  order  to  reduce  the  work  performed  in  main- 
taining parent  vectors,  a  private  "template"  parent 
vector  is  cached  for  each  level  of  nesting.  When  a 
block  b  updates  its  parent  vector,  the  process  which 
is  performing  the  work  of  6  also  records  the  modi- 
fications When  block  6  terminates  it  must  update 
the  parent  vectors  of  its  children  blocks.  Instead 
of  comparing  all  of  the  entries  in  its  parent  vec- 
tor, it  simply  compares  the  recorded  modifications. 


Mf  local  memory  is  available  and  the  cost  of  accessing  shared 
memory  is  much  higher  than  local  memory,  sharing  of  data  struc- 
tures may  not  be  cost-effective. 


Likewise,  when  a  new  block  is  created  the  template 
parent  vector  must  be  reinitialized.  The  modifica- 
tion records  are  used  to  back  out  the  changes  made 
rather  than  reinitializing  all  of  the  entries. 
For  programs  with  only  doall  and  endall  operation, 
the  per  block  cost  is  reduced  from  the  maximum 
concurrency  of  the  graph  to  the  maximum  level  of 
nesting  (the  number  of  times  that  a  change  must 
be  either  added  or  backed  out  of  a  parent  vector) 
and  the  amount  of  space  and  work  per  process  is 
the  product  of  the  maximum  concurrency  and  the 
level  of  nesting.  Because  coordination  lists  would 
have  to  be  traversed  to  perform  the  updates,  there 
is  no  corresponding  optimizations  for  the  English- 
Hebrew  labeling  scheme. 

5.2      Concurrency  Structure 

For  all  four  benchmark  parallel  programs,  the  concur- 
rency structure  is  quite  simple:  there  is  very  limited 
nesting  of  doall  constructs  and  minimal  synchroniza- 
tion. However  the  degree  and  granularity  of  parallelism 
vary  considerably. 

•  Triso  has  coarse  granularity  parallelism  with  lim- 
ited synchronization.  It  consists  of  a  single  doall 
operation  which  creates  8  parallel  threads;  these 
subsequently  perform  two  dorrier  synchronization 
operations. 

•  Simple  has  medium  granularity  parallelism  with 
some  locking.  It  performs  10  doall  operations  that 
each  create  124  parallel  threads,  and  130  operations 
that  create  from  10  to  30  threads.  In  addition,  dur- 
ing 10  phases  of  execution  approximately  half  of  the 
30  concurrent  threads  obtain  a  lock  (represented  by 
an  asynchronous  coordination  edge  in  the  POEG). 

•  Finite  exhibits  a  large  degree  of  fine  granularity 
parallelism  and  does  not  use  coordination  opera- 
tion. It  performs  60  doall  operations  that  each  cre- 
ate 1000  parallel  threads;  50  operations  that  create 
250  threads  and  200  operations  that  create  between 
2  and  32  threads.  Each  block  performs  a  very  lim- 
ited amount  of  computation;  in  many  case  a  block 
consists  of  a  single  operation  on  an  array  element. 

•  Polymer  also  exhibits  a  large  degree  of  fine  granu- 
larity parallelism  and  does  not  use  coordination  op- 
eration; however,  it  has  one  level  of  nested  doall  op- 
erations (the  first  three  benchmark  programs  have 
no  nested  doall  operations).  It  performs  40  nested 
doall  operations;  the  outer  operations  create  1000 
parallel  threads  each  of  which  creates  3  parallel  sub- 
threads.  In  addition,  it  performs  20  doall  opera- 
tions which  create  350  parallel  threads  and  10  doall 
operations  which  create  100  parallel  threads. 

The  last  two  programs  have  such  fine  granularity  paral- 
lelism that  monitoring  them  primarily  measures  the  cost 


of  maintaining  concurrency  information  rather  than  the 
cost  of  checking  accesses  to  shared  variables.  Table  1 
summaries  the  concurrency  parameters  for  the  bench- 
mark programs.  The  experimental  results  presented  in 


Total  # 

Max 

Average 

Program 

Blocks 

Concur 

Concur 

Triso 

24 

8 

8 

Simple 

4090 

124 

28 

Finite 

74,900 

1,000 

245 

Polymer 

128,000 

3,000 

1,800 

Table  1:  Concurrency  Structure 

Sections  5.4  and  5.5  show  that  the  concurrency  struc- 
ture of  the  program  has  a  significant  impact  on  the  cost 
of  maintaining  concurrency  information.  However,  the 
coneurrency  structure  does  not  significantly  impact  the 
cost  per  variable  access,  as  discussed  in  the  next  section 

5.3      Shared  Variable  Access  Patterns 

The  work  and  space  required  for  maintaining  access  his- 
tories is  proportional  to  the  average  size  of  the  reader 
set  of  an  access  history.  While  theoretically  the  size 
of  the  reader  set  may  grow  to  the  maximum  concur- 
rency of  the  graph,  in  practice  the  size  is  much  smaller. 
Many  parallel  scientific  codes  distribute  workloads  by 
partitioning  data  among  concurrent  threads.  Hence  a 
thread  often  shares  data  with  a  neighboring  thread,  but 
seldom  shares  data  with  all  other  threads.  Therefore, 
one  would  expect  the  number  of  concurrent  readers  to 
be  limited. 

The  shared  memory  access  patterns  measured  for 
the  benchmark  programs  support  this  conclusion  and 
are  shown  in  Table  2.  Each  column  gives  the  percentage 
of  accessed6  shared  variables  with  at  most  the  specified 
number  of  concurrent  readers.  In  the  Triso  program,  for 
example,  15%  of  the  shared  variables  have  two  concur- 
rent readers  at  some  point  during  execution,  but  never 
have  more  than  two. 

The  last  column  of  Table  2  shows  the  average  max- 
imum reader  set  size  for  all  variables  with  less  than  10 
concurrent  readers' .  For  the  first  three  benchmark  pro- 
grams, this  includes  virtually  all  shared  variables  and  is 
an  upper  bound  on  the  size  of  the  reader  set.  Since  for 
the  Polymer  program  10%  of  the  variables  have  more 
than  9  concurrent  readers,  the  figure  in  the  last  column 
only  reflects  statistics  for  90%  of  its  shared  variables.  If 
we  are  conservative  and  assume  that  all  of  these  vari- 
ables are  accessed  by  all  3,000  concurrent  block,  the 
average  reader  set  size  is  279  concurrent  readers  (which 


Because  fixed  maximum  sized  arrays  are  used,  many  shared 
variables  are  never  accessed. 

7  It  is  not  possible  to  measure  reader  sets  larger  than  10  due  to 
memory  constraints  on  the  Ultracomputer 


Program 

Percentage  of  Variables  with  Reader  Set  Sizes 

Ave 
Size 

1 

2 

3 

4 

5 

6 

i 

8 

9 

>9 

Triso 
Simple 

Finite 
Polymer 

80.2% 
49.3% 
48.4% 
72.9% 

15.0% 
41.0% 
32.5% 

9.4% 

4.8% 
9.6% 
4.6% 
2.6%, 

0.0% 
t 
t 

1.1% 

0.0% 
0.0% 

t 

2.8% 

0.0%. 
0.0% 
0.9% 

0  *7< 

0.0% 
0.0% 

t 

2.6% 

0.0% 
0.0% 

t 

6.2% 

0.0% 

0.0% 

13.4% 

2.0% 

0.0% 
t 
t 

9.3% 

1.25 
1.69 
2.71 
1.31 

t  Less  than  0.1%  of  the  variables  have  reader  sets  of  this  size 

Table  2:  Reader  Set  Sizes 


is  still  much  less  than  3,000).  If  we  are  slightly  less 
conservative  and  assume  that  all  of  these  variables  are 
accessed  by  the  average  number  of  concurrent  blocks, 
the  average  reader  set  size  is  168. 

Table  2  reveals  two  somewhat  surprising  results: 

1.  The  size  of  reader  sets  tends  to  be  very  small.  In 
all  four  programs,  more  than  80%  of  the  variables 
are  never  read  by  more  than  two  concurrent  readers 
and  almost  50%  are  never  read  concurrently.  More- 
over, a  variable  with  maximum  measured  reader  set 
size  of  n  may  actually  have  a  much  smaller  reader 
set  size  throughout  most  of  the  execution  of  the 
program.  Hence  our  estimate  of  the  average  reader 
set  size  is  pessimistic  with  respect  to  the  actual  av- 
erage reader  set  size. 

2.  There  appears  to  be  little  correlation  between  the 
average  reader  set  size  and  the  degree  of  parallelism 
of  the  program.  For  example,  the  Triso  and  Poly- 
mer programs  have  fairly  similar  access  patterns  for 
the  majority  of  their  variables  (excepting  the  9.3% 
in  the  Polymer  program  with  more  than  9  concur- 
rent readers).  Triso,  however,  has  a  very  low  degree 
of  parallelism  while  Polymer  has  nested  parallelism 
of  a  very  high  degree. 

Since  the  average  reader  set  size  is  so  much  smaller  than 
their  worst  case  bounds,  we  conclude  that  the  access 
history  based  algorithms  are  preferable  to  that  of  [11] 
in  which  the  storage  depends  on  the  maximum  concur- 
rency of  the  POEG. 

5.4      Space  Requirements 

For  the  results  described  in  the  following  two  sections 
four  versions  of  each  program  were  executed: 

Unmonitor        -  unmonitored  program 
Monitor(l)        -  monitors  every  shared  variable 

using  a  reader  set  size  of  one 
Momtor(2)        -  monitors  every  shared  variable 

using  a  reader  set  size  of  two 
Concurrency      -  maintains  concurrency  information 

only;  no  variables  are  monitored 

Table  3  compares  the  memory  requirements  for  each 
of  the  four  versions  of  the  benchmark  programs  listed 
above.    The  static  size  is  the  size  of  the  object  mod- 


Program 

Static  Stz 

es 

Dynamic 

Unm 

Mon(l) 

Mon(2) 

TR 

A-// 

Triso 

163 

660 

1,000 

2 

8 

Simple 

314 

921 

1,270 

6 

80 

Finite 

230 

745 

1,020 

29 

954 

Polymer 

639 

2,898 

4.461 

279 

7.000  t 

t  Estimated  Value 

Table  3:  Space  Requirements  (in  Kbytes) 


ule  which  includes  additional  monitoring  code  as  well 
as  storage  needed  for  access  histories.  (This  is  virtu- 
ally the  same  for  both  algorithms).  The  dynamic  size 
is  the  amount  of  storage  needed  for  concurrency  infor- 
mation, which  is  allocated  at  run-time.  As  is  seen  in 
Table  3,  English-Hebrew  labeling  requires  substantially 
more  space  than  task  recycling  for  maintaining  concur- 
rency information.  In  fact,  it  is  not  possible  to  obtain 
the  actual  dynamic  memory  requirements  of  Polymer 
for  the  English-Hebrew  labeling  scheme;  the  amount  of 
storage  needed  for  concurrency  information  exceeds  the 
capacity  of  the  Ultracomputer  and  hence  an  estimate 
was  calculated  based  on  the  concurrency  structure  of 
the  POEG. 

5.5      Execution  Times 

Table  4  displays  user  mode  execution  times  for  the  four 
benchmark  programs  using  both  of  the  anomaly  detec- 
tion algorithms.   Concurrency  isolates  the  cost  of  mam- 


Task  Recyc 

ling 

Program 

Unm 

Cone 

Mon(l) 

Mont?) 

Triso 

3.5 

4  1 

19.6 

22 : 

Simple 

202 

263 

550 

598 

Finite 

98 

734 

959 

1108 

Polymer 

789 

3330 

4367 

4607 

English- Hebrew  t 

Program 

Unm 

Cone 

Mon(l) 

Mon(2) 

Triso 

3.5 

4.7 

20.9 

24  4 

Simple 

202 

237 

579 

676 

Finite 

98 

407 

704 

1D72 

fPolymer  was  not  morutored  due  to  memory  constraints 


Table  4:  Total  Execution  Time  (in  seconds) 


taining  the  concurrency  information  from  the  overall 
cost  of  detecting  access  anomalies.  This  cost  is  sub- 
stantial for  the  programs  with  very  high  degrees  of  fine- 
grained parallelism.  In  the  current  implementation  of 
the  task  assignment  algorithm,  assigning  and  freeing  a 
task  requires  locking  vertices  in  the  free  task  dag,  which 
creates  a  serial  bottleneck.  If  free  tasks  are  stored  in  par- 
allel access  data  structures  (e.g.  parallel  access  queues), 
it  may  be  possible  to  decrease  this  serialization  effect. 

For  the  Triso  and  Simple  programs,  the  English- 
Hebrew  labeling  algorithm  requires  more  time  than  task 
recycling  for  monitoring  variables  (as  is  shown  in  the 
run  times  for  Monitor(l)  and  Monitor(2)),  even  when 
the  cost  of  maintaining  concurrency  information  is  less 
(as  in  shown  in  the  run  times  for  Concurrency).  The 
primary  cost  in  the  Polymer  and  Finite  programs  is 
the  maintenance  of  concurrency  information;  this  is  not 
surprising  given  the  fine  granularity  of  the  parallelism 
in  the  two  programs  For  both  algorithms,  the  overall 
increase  in  compulation  time  is  too  high  for  transpar- 
ent monitoring  of  all  shared  references.  Nevertheless, 
we  not  believe  the  cost  it  too  prohibitive  for  on-the-fly 
anomaly  detection  to  be  a  useful  debugging  tool.  More- 
over, static  analysis  and/or  a  compiler-based  implemen- 
tation should  make  it  more  efficient. 

Note  that  the  time  measurements  presented  are  the 
total  execution  times  of  all  concurrent  threads  rather 
than  elapsed  running  time.  The  maintenance  of  con- 
currency information  results  in  a  57%  -  160%  increase 
in  elapsed  running  time,  while  monitoring  all  accesses 
to  all  shared  variables  with  reader  sets  of  size  one  incurs 
a  160%  to  460%  increase  in  execution  times.  The  differ- 
ence between  the  elapsed  and  execution  time  increases 
stems  from  Amdahl's  law:  almost  all  of  the  additional 
work  is  performed  in  parallel. 

From  Section  5.3  we  know  that  using  two  entry 
reader  sets  instead  of  one  entry  reader  sets  significantly 
increases  the  percentage  of  variables  guaranteed  to  have 
at  least  one  anomaly  detected  (from  50%  to  80%).  A 
comparison  of  the  execution  times  for  Monttor(l)  and 
Monitor(2)  in  Table  4  shows  that  we  may  do  so  with 
little  additional  cost.  Table  5  isolates  the  cost  of  main- 
taining access  histories  and  shows  the  percentage  in- 
crease when  using  reader  sets  of  size  two  instead  of  one. 
The  isolated  times  are  computed  as  the  total  monitored 


Program 

Mon(l) 

Mon(2) 

Increase 

TR 

E-H 

TR 

E-H 

TR 

E-H 

Triso 
Simple 
Finite 

16.4 
287 
225 

16.3 
242 

297 

21  0 
335 
354 

22.1 

339 

685 

20% 
17% 
57% 

22% 

40% 

130% 

Table  5:  Access  History  Update  Time  (in  seconds) 
execution  time  (  Monttor(l)  and  Momtor(2))  less  the  ex- 


ecution time  when  simply  maintaining  concurrency  in- 
formation {Concurrency).  These  relative  time  increases 
indicate  that  as  reader  sets  grow  in  size,  the  cost  of 
detecting  anomalies  increases  more  rapidly  for  English- 
Hebrew  labeling  than  for  task  recycling.  Therefore,  as 
larger  reader  sets  are  used  the  cost  of  the  more  complex 
task  concurrency  verification  in  the  English-Hebrew  al- 
gorithm becomes  increasingly  more  significant. 

6      Concluding  Remarks 

This  paper  presents  the  problem  of  detecting  access 
anomalies  in  parallel  programs  under  a  models  of  con- 
currency which  incorporates  nested  parallel  loops  and 
synchronous  and  asynchronous  coordination.  An  exist- 
ing on-the-fly  algorithm  -  English-Hebrew  labeling  -  is 
described  and  compared  with  a  new  algorithm  based 
on  task  assignment  to  a  partial  order  execution  graph. 
Experimental  data  from  monitoring  four  benchmark  sci- 
entific programs  using  the  two  algorithms  indicate  four 
important  results: 

1  The  benchmark  programs  use  data  partitioning  so 
extensively  that  over  80%  of  all  variables  never  have 
more  than  two  concurrent  readers.  Therefore,  the 
size  of  access  histories  appears  to  be  independent 
of  the  degree  of  parallelism  within  the  program. 

2.  The  task  recycling  algorithm  requires  much  less 
storage  for  concurrency  information  than  English- 
Hebrew  labeling. 

3.  Because  of  its  efficient  concurrency  information 
management,  English-Hebrew  labeling  performs 
better  on  programs  with  frequent  doall  operations 
and  relatively  few  accesses  to  shared  variable. 

4.  However,  as  reader  sets  or  concurrency  lists  in- 
crease in  size  the  high  cost  of  performing  concur- 
rency verification  in  English-Hebrew  labeling  out- 
weighs the  cost  of  parent  vector  management  in  the 
task  recycling  algorithm. 

If  the  benchmark  programs  are  indicative  of  a  wide  class 
of  parallel  programs,  the  task  recycling  algorithm  is  an 
important  improvement  over  existing  technology. 

Acknowledgements 

We  would  like  to  thank  Marc  Snir  and  Larry 
Rudolph  for  their  introduction  to  this  problem  and  Ron 
Cytron,  Allan  Gottlieb,  Bud  Mishra  and  Ed  Schonberg 
for  their  many  helpful  suggestions. 

References 

[1]  Todd  R.  Allen  and  David  A.  Padua.  Debugging 
Fortran  on  a  Shared  Memory  Machine.  In  Pro- 
ceedings of  the  International  Conference  on  Paral- 
lel Processing,  pages  721-717,  Aug  1987. 


[2]  David  Callahan  and  Jaspai  Subhlok.  Static  Anal- 
ysis of  Low  Level  Synchronization.  In  Proceedings 
on  the  SIGPLAN  Workshop  on  Parallel  and  Dis- 
tributed Debugging,  pages  100-111,  May  1988. 

[3]  Jong-Deok  Choi,  Barton  P.  Miller,  and  Robert  Net- 
zer.  Techniques  for  Debugging  Parallel  Programs 
with  Flowback  Analysis.  Technical  Report,  Univer- 
sity of  Wisconson,  Aug  1988. 

[4]  Anne  Dinning  and  Edith  Schonberg.  An  Evalua- 
tion of  Monitoring  Algorithms  for  Access  Anomaly 
Detection.  Technical  Report  Ultracomputer  Note 
#163,  New  York  University,  July  1989. 

[5]  Perry  A.  Emrath  and  David  A.  Padua.  Automatic 
Detection  of  Nondeterminancy  in  Parallel  Pro- 
grams. In  Proceedings  on  the  SIGPLAN  Workshop 
on  Parallel  and  Distributed  Debugging,  pages  89- 
99,  May  1988. 

[6]  Allan  Gottlieb.  An  Overview  of  the  NYU  Ultra- 
computer  Project.  In  J.J.  Dongarra,  editor,  Exper- 
imental Parallel  Computing  Architectures,  pages  25 
-  95,  Elsevier,  1988. 

[7]  Leslie  Lamport.  Time,  clocks,  and  the  ordering  of 
events  in  a  distributed  system.  Communications  of 
the  ACM,  21(7),  Jul  1978. 

[8]  Barton  P.  Miller  and  Jong-Deok  Choi.  A  Mecha- 
nism for  Efficient  Debugging  of  Parallel  Programs. 
In  Proceedings  on  the  SIGPLAN  Workshop  on  Par- 
allel and  Distributed  Debugging,  May  1988. 

[9]  Itzhak  Nudler  and  Larry  Rudolph.  Indeterminancy 
Considered  Harmful.  1988. 
[10]  Itzhak  Nudler  and  Larry  Rudolph.  Tools  for  the  Ef- 
ficient Development  of  Efficient  Parallel  Programs. 
In  1st  Israeli  Conference  on  Computer  System  En- 
gineering, 1988. 

[11]  Edith  Schonberg.  On-The-Fly  Detection  of  Access 
Anomalies.  In  Proceedings  on  the  SIGPLAN  Con- 
ference on  Programming  Language  Design  and  Im- 
plementation, Jun  1989. 

[12]  Marc  Snir.  Private  correspondence.   1988. 


Variations  of  this  result  date  back  at  least  to  1966  and  proofs  are  given  in  a  number 
of  papers;  see  e.g.  Bramble  [12],  Bramble,  Pasciak  and  Schatz  [15]  or  Yserentant  [58]. 

Since  all  the  elements  of  V^  vanish  at  the  vertices  of  the  substructures,  we  must 
choose  the  interpolant  iH^h  as  the  element  in  VH  when,  as  required  by  Lemma  1,  we 
define  the  representation  of  u/, .  It  is  easy  to  show  that  l-0/ufcl  m/Q  can  be  estimated 
from  above  by 

(17)  Yl  (uh(xk)-uh(x,))2, 

fc,/ev, 

where  the  V,  is  the  set  of  vertices  of  the  substructure  Q, .  It  then  follows  from  Lemma 
2  that 

(18)  l^fclffitflo  - const-^  +  ^(H/h))\uh\2Hlint)  ■ 

For  a  bound  on  the  other  terms  in  the  decomposition,  required  when  using  Lemma  1, 
see  Dryja  and  Widlund  [28]. 

Using  the  method  with  local  preconditioners  developed  in  the  end  of  Section  3,  a 
number  of  algorithms  can  be  derived  from  the  basic  method  that  we  have  just  intro- 
duced. All  that  is  needed  is  to  replace  the  Dirichlet  problem  on  Q,-j  by  a  preconditioner 
with  a  symmetric,  positive  definite  coefficient  matrix.  A  proof  of  the  main  result  of 
Bramble,  Pasciak  and  Schatz  [15]  can  be  obtained  in  this  way  by  using  (14),  (15), 
Theorem  2  and  a  bound  for  the  condition  number  of  the  two  subregion  problem  de- 
fined on  fi,j.  Such  local  bounds  are  given  e.g.  in  Bjorstad  and  Widlund  [9],  Bramble, 
Pasciak  and  Schatz  [18]  and  Dryja  [22],  [23]. 

In  our  previous  paper,  only  equation  (2)  was  considered.  By  using  the  formulas 
(7),  (8),  and  (9),  the  proof  can  be  extended  immediately  to  equation  (3);  the  whole  ma- 
chinery can  also  be  used  locally.  In  this  variant  of  the  proof,  only  four  subspaces  play  a 
role,  namely,  the  three-dimensional  space  of  linear  functions  on  fi,,  and  the  restriction 
to  Q,  of  the  spaces  V-j  which  correspond  to  the  three  edges  of  that  substructure. 

In  preparation  for  the  three-dimensional  case,  we  develop  an  alternative  basic 
algorithm  for  the  two-dimensional  problems.  This  algorithm  is  defined  in  terms  of 
subspaces  related  to  individual  substructures.  For  any  substructure  fi,-,  we  keep  the 
three  local  subspaces  obtained  as  above  from  the  spaces  V-j.  For  an  interior  substruc- 
ture, we  replace  the  three-dimensional  space  of  linear  functions,  used  previously,  by 
the  three-dimensional  space  spanned  by 

(19)  0i°  =  v*  -  4°  Ew  +  tf\  k  e  vt. 

lev, 

Here  <pk  is  the  standard  nodal  basis  function  associated  with  an  element  of  Vt.  We  note 
that,  just  as  the  regular  nodal  basis  functions,  ^[''(xi)  =  6u,  k,l  6  V^.  The  positive 
constants  0j*'  are  chosen  below,  so  that 


Then, 


£  W  =  i. 

fcev, 


Jtev, 

n 


This  subspace,  thus,  also  contains  the  constants,  i.e.  the  null  space  of  the  elliptic 
operator  restricted  to  fi,  with  a  Neumann  boundary  condition.  (When  these  methods 
are  extended  to  elasticity  problems,  we  similarly  have  to  include  the  whole  null  space  of 
that  operator  in  the  space  that  accounts  for  the  global  transportation  of  information.) 
For  a  substructure  that  intersects  the  boundary,  we  simply  use  the  space  spanned 
by  the  standard  basis  functions  <pk  corresponding  to  the  vertices  of  the  substructures 
where  a  Dirichlet  condition  is  not  imposed.  For  the  correct  definition  of  boundary 
substructures  for  the  case  of  three  dimensions;  see  below. 

An  elementary  computation  shows  that,  for  an  interior  substructure,  the  quadratic 
form  associated  with  this  subspace  is  given  by 

(20)  E42k(^)-40)2- 

kev, 

Here,  u]^  is  the  weighted  average  of  u^(xfc),  k  £  V,,  with  the  weights  @k  .  It  is  easy  to 
see  that  this  quadratic  form  can  be  bounded  from  above  and  below  by  the  form  (17) 
given  in  the  discussion  of  the  previous  algorithm. 

Inspired  by  Mandel  [36],  we  now  introduce  the  quadratic  form 

(2D         j(Uh,y)  =  E  E  *!*(«*(**)  -  y{i))2  -2EE  «*(**)*i?- 

«   fcev,  .    kev, 

For  substructures  which  have  at  least  one  vertex  on  3ft,  we  set  yW  =  0.  In  three 
dimensions,  we  adopt  the  same  rule  for  all  substructures  which  have  at  least  an  edge 
on  dQ.,  but  treat  those  which  only  touch  the  boundary  at  individual  points  as  if  they 
were  interior. 

The  vector  6^)  corresponds  to  the  part  of  the  load  vector  which  is  associated  with 
0,.  Its  components  are  computed  by  evaluating  an  appropriate  functional  at  ipk  ■ 


It  is  easy  to  show  that  if 


fcW 

j(0  _         Kkk 


2^i€V.  Kll 

then, 

t 

j(fl    _    „„,„,;„    V"    L-^l 


(22)  4"  =  argmin  £  kk'>(uh(xk)  -  y^f 

fcev, 


y(o 


Therefore,  the  solution  of  the  linear  system  corresponding  to  these  subspaces  can  be 
obtained  by  minimizing  J  with  respect  to  Uh  and  y.  By  setting  the  gradient  of  J 
equal  to  zero,  we  obtain  a  linear  system  of  equations,  which  naturally  can  be  written 
in  a  two-by-two  block  form.  The  diagonal  blocks  are  diagonal  with  elements  J2i^k'k 
and  ^Ztgv  kjfcjt?  respectively.  The  off-diagonal  blocks  are  sparse  and  are  also  given  by 

fc(,) 

Kkk- 

Since  the  diagonal  blocks  are  diagonal,  it  is  easy  to  eliminate  all  the  variables 
Uh(xk)-  This  elimination  step  results  in  a  sparse  system,  similar  to  that  of  a  finite 
difference  problem  on  a  coarse  mesh  that  is  dual  to  the  one  used  to  define  the  space 
VH .  It  is  often  feasible  and  economical  to  solve  this  system  by  a  direct  method.  Once 
the  values  of  u^  are  known,  it  is  easy  to  determine  the  values  of  Uh(xk),  k  6  V,,  and  the 
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corresponding  element  of  the  subspace.  The  contributions  of  the  other  subspaces  can 
now  be  computed,  while  observing  that  we  must  make  the  sum  of  all  the  contributions 
continuous  across  the  interfaces  between  the  substructures. 

Remark.  The  method  just  outlined  draws  heavily  on  the  work  of  Bramble,  Pasciak 
and  Schatz  [16].  However,  the  current  algorithm  uses  a  different  average  and  basis. 
Our  theory  does  not  require  that  the  linear  system,  which  determines  the  averages,  to 
have  a  M— matrix.  Our  method  can  therefore  be  extended  to  the  elliptic  systems  of 
elasticity;  cf.  also  Mandel  [36]. 

It  can  be  shown,  straightforwardly,  that  there  is  no  longer  a  need  to  insist  on 
triangular  substructures.  In  the  general  case,  we  can  simply  redefine  V,  as  the  set  of 
nodal  points  which  belong  to  the  boundaries  of  at  least  three  substructures. 

Lemma  2  holds  only  in  two  dimensions;  it  resembles  a  Sobolev  inequality  which 
is  far  from  valid  in  three  dimensions.  However,  for  finite  element  functions,  we  can 
find  a  useful  bound  of  the  Li  —  norm  over  an  interval  in  terms  of  the  strain  energy. 
Let  W,  denote  the  wire  basket  of  the  substructure  ft,.  This  is  the  union  of  the  edges 
of  the  tetrahedral  substructure;  more  generally,  it  is  defined  as  the  set  of  nodal  points 
of  fi;,  which  belong  to  the  boundaries  of  at  least  three  substructures.  The  following 
lemma  is  essentially  a  corollary  of  Lemma  2. 

LEMMA  3.  Let  a  be  any  convex  combination  of  values  of  Uh(x),x  6  W,.  Then 


h\Wh  -  a||?j(w.)  ^  consL  C1  +  lo&(H/h))\uh\2Hi{cit 


An  algorithm  for  the  three-dimensional  case  can  now  be  defined  in  terms  of  the 
subspaces  V^,  which  are  defined  just  as  in  the  two-dimensional  case,  and  a  special 
subspace.  This  subspace  is  given  in  terms  of  a  basis  defined  as  in  (19),  where  we 
replace  the  set  V,-  by  W,.  The  study  of  the  linear  algebra  problem  associated  with 
this  basis  proceeds  very  much  as  in  the  two-dimensional  case.  We  only  note  that 
when  we  compute  the  expression  corresponding  to  (20),  we  also  get  off-diagonal  terms 
corresponding  to  nodes  on  W;,  which  are  next  neighbors.  It  is  however  easy  to  show 
that  these  contributions  can  be  disregarded.  The  rest  of  the  analysis  and  the  devel- 
opment now  follows  just  as  in  the  two-dimensional  case.  We  note  that  the  number  of 
elements  in  W,  grows  linearly  with  H/h.  The  nontrivial  part  of  the  sparse  linear  sys- 
tem of  the  special  subspace  has  a  dimension,  which  is  equal  to  the  number  of  interior 
substructures. 

We  have  obtained  the  following  result. 

Theorem  3.  The  operator  P  of  the  additive  algorithm  defined  by  the  spaces  Vfi 
and  that  defined  by  the  basis  functions  given  by  (19),  or  the  corresponding  formula  for 
the  three-dimensional  case,  satisfies  the  estimate  k(P)  <  const.(l  +  log(H/h))2  for 
equations  (2)  and  (3).  The  constant  is  independent  of  h,  H  and  the  discontinuities  of 
the  coefficient  p{x). 

6.  Schwarz  Methods  for  Problems  on  the  Interfaces.  We  have  already  seen 
that  the  original  linear  systems  of  equations  can  be  reduced  to  smaller  systems,  where 
the  remaining  variables  are  associated  with  the  boundaries  dVtx  of  the  substructures. 
We  have  also  shown  that  this  reduced  set  of  variables  are  associated  with  the  piece- 
wise  discrete  harmonic  functions.  This  is  quite  similar  to  potential  theory  for  elliptic 
problems  where  there  is  also  a  reduction  in  dimension. 
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Here,  we  only  consider  the  first  model  case  (2),  since  at  this  time,  we  do  not 
know  if  bounds  independent  of  the  variations  of  the  coefficients  of  equation  (3)  can 
be  obtained  for  the  methods  that  we  are  going  to  introduce.  We  denote  by  V£arm(T) 
the  space  of  piecewise  discrete  harmonic  functions  defined  by  the  values  of  the  finite 
element  functions  on  the  set 

(23)  r  =  [Jdnx\dn. 

The  space  of  traces  Hll2{T)  are  the  restrictions  of  H1^)  to  T.  As  is  well  known,  the 
norm  of  this  space  can  be  defined  by 

(24)  \u\w/*(r)  =  Y,\u\2Hi/i(dn,y 

i 

where 

(25)  M^an.)  =  J     JdQ  (!«(*(*))  "  u(x(t))\2/\x(s)  -  x(t)\d)dsdt. 

Here,  d  =  2,3,  for  problems  with  fi  C  Rd. 

We  consider  several  different  additive  Schwarz  methods.  The  first  is  defined  by 
the  partitioning 

(26)  V&rm(H  =  X>Wm,.-, 

t 

where  V£arm  t  is  the  subspace  ofV£arm(T)  with  zero  values  on  T  outside  dfi,.  By  using 
the  same  technique  as  in  the  previous  three  sections,  it  is  easy  to  see  that  the  operator 
P.  which  correspond  to  this  set  of  subspaces,  is  bounded  uniformly  from  above.  We  can 
use  the  techniques  of  Widlund  [54]  and  Dryja  [24]  for  the  two-  and  three-dimensional 
cases,  respectively  and  Lemma  1,  to  obtain  the  bound  C$  <  const. (1  +  log(H/h))2.  It 
is  not  possible  to  obtain  a  uniform  bound,  since  then  the  spaces  H0Q  and  H1'2  would 
have  to  be  the  same:  cf.  Lions  and  Magenes  [33]. 

The  computations  associated  with  the  space  V£armt  involve  piecewise  discrete 
harmonic  functions,  which  differ  from  zero  in  all  the  substructures  which  are  next 
neighbors  of  ft,-.  The  coefficient  matrix  of  the  linear  system,  which  is  solved  when 
computing  the  projection  of  V£arm  onto  V£arm _,-  is  the  principal  minor  of  5  associated 
with  the  nodes  on  <9fi,.  It  also  involves  contributions  from  the  matrices  S^'  of  the 
neighbors  of  Q,  .  It  is  natural  to  replace  this  matrix  by  S^'\  since  the  use  of  this 
preconditioner  only  involves  a  problem  on  one  substructure. 

What  we  have  just  given  is  an  alternative  derivation  of  an  algorithm  introduced 
by  Bourgat,  Glowinski,  Le  Tallec  and  Vidrascu  [11].  We  note  that  there  is  a  minor 
technical  issue  related  to  the  fact  that  5^  is  singular  for  interior  substructures,  but 
that  it  is  easy  to  solve  that  problem.  Using  the  formalism  developed  in  Section  2,  we 
can  write  down  the  following  formula  for  the  preconditioner  which  corresponds  to  this 
algorithm, 

(27)  45"'!/B  =  E^Vr1!/?. 

Using  Lemma  3.2  in  Widlund  [54],  the  corresponding  estimates  developed  for  the 
three-dimensional  case  in  Dryja  [24]  and  the  estimate  (16),  we  can  conclude  the  proof 
of  the  following  result. 
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THEOREM  4.  The  operator  P  of  the  additive  algorithm  defined  by  the  spaces 
Vharm,i,  applied  to  equation  (2),  and  the  local  preconditioner  defined  by  S^  satisfies 
the  estimate  k(P)  <  const.(l+\og(H  /  h))4 .  The  constant  in  the  estimate  is  independent 
of  h  but  not  of  H. 

By  using  an  elementary  argument  given  in  Widlund  [54],  it  is  easy  to  see  that  k(P) 
must  grow  at  least  as  fast  as  l/H2,  since  in  each  iteration,  information  is  exchanged 
only  with  neighboring  substructures.  The  performance  of  the  algorithm,  for  the  case 
of  many  substructures,  can  be  improved  by  introducing  the  same  spaces  of  modest 
dimension,  which  provide  some  global  transportation  of  information  in  the  methods 
considered  in  Section  5.  After  this  modification,  the  constant  in  the  theorem  can  be 
shown  to  be  independent  of  H  as  well. 

We  note  that  in  each  iteration  of  this  algorithm,  both  a  Neumann  and  a  Dirichlet 
problem  have  to  be  solved  on  each  substructure.  This  results  in  twice  as  much  work 
per  iteration  as  for  many  other  domain  decomposition  methods. 

If  there  is  a  red-black  ordering  of  the  substructures,  there  is  another  way  of  an- 
alyzing this  algorithm.  We  first  consider  the  two  subdomain  case.  We  note  that  the 
inverse  of  the  preconditioner  corresponding  to  the  Neumann-Dirichlet  algorithm  is 
S'1)  or  S'2)  depending  on  the  roles  assigned  to  the  substructures;  cf.  Bj0rstad 
and  Widlund  [9].  We  can  therefore  view  the  preconditioner  introduced  in  this  section 
as  an  average  of  Neumann-Dirichlet  operators.  In  this  simple  case,  the  condition  num- 
ber remains  uniformly  bounded  since  there  are  no  so-called  cross  points.  Similarly,  we 
can  use  a  result  by  Dryja  [24]  to  prove  that  the  algorithm  for  three  dimensions,  which 
incorporates  the  global  subspace  introduced  in  the  previous  section,  has  a  condition 
number  which  is  bounded  by  const. (l  +  log(H/h))3.  This  again  follows  from  the  obser- 
vation that  this  new  preconditioner  can  be  viewed  as  an  average  of  two  instances  of  the 
operator  which  arises  in  the  study  of  the  version  of  the  Neumann-Dirichlet  algorithm 
considered  in  Dryja  [24]. 

We  conclude  this  section  by  a  brief  discussion  of  some  recent  work  by  Barry  Smith, 
a  Courant  Institute  graduate  student.  His  algorithm,  described  in  Smith  [48],  is  an 
additive  Schwarz  method,  which  uses  the  subspace  VH  both  in  two  and  three  dimen- 
sions. We  give  some  details  only  for  the  three-dimensional  case.  In  addition  to  VH, 
there  are  subspaces  associated  with  each  face,  edge  and  vertex  of  the  substructures. 
The  subspace  associated  with  a  face  is  H0q  (Fij)f)V£arm  and  therefore  closely  related 
to  the  space  Vq  introduced  in  Section  5.  Similarly,  for  an  edge  (vertex),  Smith  uses 
a  space  associated  with  a  neighborhood  in  T  of  the  edge  (vertex)  which  extends  a  dis- 
tance on  the  order  of  the  diameter  of  Q,.  The  success  of  this  approach  can  be  explained 
informally  by  noting  that  the  overlap  is  relatively  generous.  The  numerical  results, 
so  far  confined  to  two  dimensions  and  simple  substructures  with  benign  aspect  ratios, 
show  great  promise.  In  addition  to  Poisson's  equation,  membrane  and  shell  models 
have  been  tested.  So  far,  the  condition  numbers  have  always  been  less  than  4. 

The  following  theoretical  result  has  been  established  for  both  two  and  three  di- 
mensions. It  also  holds  for  elliptic  systems  such  as  those  of  linear  elasticity;  cf.  Smith 

[48]. 

THEOREM  5.  The  operator  P  of  the  additive  algorithm  developed  by  Smith  satisfies 
the  estimate  k(P)  <  const.  The  constant  in  the  estimate  is  independent  of  h  as  well 


s  H. 

15 


7.  Optimal  Iterative  Refinement  Methods.  In  our  study  of  iterative  refine- 
ment methods,  we  consider  problems  on  a  special  kind  of  composite  finite  element 
triangulations.  We  begin  by  introducing  a  relatively  coarse  triangulation  of  ft'1'  =  ft, 
and  denote  the  corresponding  space  of  finite  element  functions  by  V'*1 .  We  can  think 
of  this  space  as  having  a  uniform  (or  relatively  uniform)  mesh  size  h\.  Let  ft'2) 
be  a  subregion  where  we  wish  to  increase  the  resolution.  We  do  so  by  subdividing 
the  elements  and  introducing  an  additional  finite  element  space  Vh7.  This  space  is 
constructed  quite  similarly  to  the  previous  one  and  it  contains  Vhl  0-^6(^2)  as  a 
subspace.  We  assure  that  the  resulting  composite  space  Vhl  +  Vh2  is  conforming  by 
having  the  functions  of  Vh2  vanish  on  5ft'2'.  We  repeat  this  process  by  selecting  a 
subregion  ft'3'  of  ft'2'  and  introducing  a  further  refinement  of  the  mesh  and  the  finite 
element  space,  etc.  We  denote  the  resulting  nested  subregions  and  subspaces  by  ft'*' 
and  Vhx,  respectively.  Throughout,  we  assume  that 

ft'1'  C  ft*'"1'  , 
and  that 

v*-1  n  tf^ft''')  c  vh>  c  ^(ft(,)),     i  =  2, . . ., k. 

The  composite  finite  element  space,  on  the  repeatedly  refined  mesh,  is 

yh   _   yh!   +  yhi  +      _  _  +  yhk 

The  finite  element  models  on  composite  meshes  is  thus  systematically  constructed 
by  introducing  a  basic  finite  element  approximation  on  the  entire  region  and  then 
selecting  subregions,  and  subregions  of  subregions  etc.,  where  the  finite  element  model 
is  further  refined  in  order  to  gain  higher  accuracy.  In  each  iteration  of  the  iterative 
refinement  methods,  problems  representing  finite  element  models  on  the  original  region 
and  the  subregions,  prior  to  further  refinement,  are  solved.  As  always,  an  additive 
algorithm  is  defined  by  specifying  the  subspaces  V,-,  or  alternatively,  the  projections 
Pi. 

In  order  to  prove  the  results  given  at  the  end  of  this  section,  we  need  some  ad- 
ditional technical  assumptions.  To  make  our  proofs  work,  we  cannot  allow  the  sets 
ft,_i  \  ft,  to  become  arbitrarily  thin  in  comparison  with  the  diameter  of  fti_i.  We  also 
assume  that  the  area  of  any  triangle  on  level  i  can  be  bounded  by  const.  ql~i  times 
the  area  of  the  triangle  on  level  j  of  which  it  is  a  part.  Here  q  is  a  constant  <  1  and 
j  <  i. 

The  fundamental  building  blocks  of  our  algorithms  are  the  projections  P-  ,  j  = 
i  —  l,i,  onto  the  spaces  Vh'  nffo(0,).  We  note  that  if  j  =  i  —  1,  we  solve  a  problem  on 
ft,  with  a  coarser  mesh  than  if  Vh>  were  used.  The  projection  Pf  is  defined  in  terms 
of  the  unique  element  of  Vh>  l~l  //o(ft,),  which  satisfies 

(28)  a{Plvh,4>h)  =  a(vh,<f>h),  V^  G  Vh>  n^(ft,). 

We  consider  two  different  algorithms  and  distinguish  between  them  by  using  su- 
perscripts. The  perhaps  most  natural  algorithm  uses  the  projections  P^  =  P*.  The 
condition  number  of  this  algorithm  can  grow  as  fast  as  linearly  with  k.  By  the  stan- 
dard argument,  it  is  easy  to  show  that  the  eigenvalues  of  P'1'  are  bounded  from  above 
by  k.  This  bound  is  attained  if  Vhi  D  H^(Qk)  is  not  empty,  i.e.  when  the  mesh  size  hi 
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is  fine  enough.  Any  function  in  this  space  belongs  to  Vh* ,  i  =  1,2, . . .,  Jfc,  and  is  exactly 
reproduced  by  each  of  the  projection  operators.  It  is  therefore  an  eigenfunction  of  P^ 
with  the  eigenvalue  k.  Similarly,  any  function  which  belongs  to  Vhl  D  #0(^1  \  ^2) 
is  an  eigenfunction  with  eigenvalue  1.  We  have  shown  in  [56],  that  the  eigenvalues  of 
P^1)  are  bounded  from  below  by  a  constant.  The  condition  number  of  P^  is  therefore 
of  order  k. 

We  are  principally  interested  in  the  additive  algorithm  defined  by  the  projections 
Pt-  =  P-  -  P/+1  ,  i  <  k  —  1,  and  PJ^  '  —  P£.  It  easy  to  show  that  these  operators  are 
projections  and  that  the  composite  finite  element  space  Vh  is  the  direct  sum  of  the 
corresponding  subspaces  V,-  '.  The  following  results  have  recently  been  demonstrated 
in  Dryja  and  Widlund  [27]  and  Widlund  [56].  The  latter  paper  also  contains  a  similar 
result  for  a  multiplicative  algorithm. 

THEOREM  6.  The  condition  number  of  PW  is  uniformly  bounded  by  a  constant. 
The  condition  number  of  P^  grows  at  most  linearly  with  the  number  of  refinement 
levels. 

We  note,  that  Tarek  Mathew  [40]  has  developed  iterative  refinement  methods 
for  Raviart-Thomas  finite  element  methods  and  obtained  bounds  for  their  rate  of 
convergence. 
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ABSTRACT 

A  distinction  is  often  made  between  domain  decomposition  algorithms  for  elliptic  partial 
differential  equations  which  use  overlapping  subregions  and  those  which  do  not. 
Schwarz's  alternating  method,  the  oldest  of  them  all,  belongs  to  the  first  category.  It  has 
recently  been  discovered  that  many  of  the  algorithms  that  belong  to  the  second  category 
can  also  be  regarded  as  generalizations  of  the  classical  Schwarz  method  or  an  additive 
variant  thereof.  This  new  approach  provides  new  tools  for  the  analysis  and  development 
of  domain  decomposition  algorithms.  In  this  paper,  we  introduce  an  abstract  additive 
Schwarz  method  and  develop  a  simple  framework  for  the  analysis  of  its  rate  of  conver- 
gence. We  show  that  it  is  possible  and  convenient  to  specify  algorithms  in  terms  of  a  set 
of  subspaces  and  related  orthogonal  projections.  This  family  of  algorithms  is  further  ex- 
panded by  replacing  the  linear  systems  of  equations,  which  correspond  to  the  projec- 
tions, by  suitable  preconditioned. 

We  illustrate  the  usefulness  of  this  approach  by  considering  additive  Schwarz  methods  of 
a  relatively  conventional  type,  iterative  substructuring  methods,  domain  decomposition 
methods  defined  on  the  curves  or  surfaces  which  subdivide  the  region,  and  iterative 
refinement  methods.  Throughout  wc  work  in  a  framework  of  conforming  finite  elements 
and  self-adjoint  problems,  but  we  also  mention  some  new  results  for  more  general  ellip- 
tic finite  clement  problems. 

Key  Words  Schwarz's  alternating  method,  domain  decomposition,  elliptic  equations, 
finite  elements. 
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1.  Introduction.  In  this  paper,  we  survey  our  recent  research  on  domain  de- 
composition and  related  algorithms  for  elliptic  finite  element  problems.  One  of  our 
aims  is  to  develop  a  set  of  common  and  powerful  tools  for  the  development  and  anal- 
ysis of  a  great  variety  of  methods.  We  have  recently  discovered  that  many  iterative 
substructuring  methods  can  be  viewed  as  so-called  additive  Schwarz  methods  and 
that  the  analysis  of  these  methods  can  be  simplified;  cf.  Dryja  and  Widlund  [28]. 
Here,  we  discuss  and  extend  this  result  and  provide  further  evidence  that  many  do- 
main decomposition  methods  fit  into  a  framework  provided  by  the  Schwarz  additive 
methods. 

We  note  that  the  bounds  given  for  the  condition  number  of  the  iteration  operators 
for  many  domain  decomposition  algorithms  considered  in  this  and  other  papers  are 
either  uniform  in  the  number  of  subdomains  and  subregions  or  grow  only  polynomially 
in  the  logarithm  of  the  number  of  degrees  of  freedom  associated  with  an  individual 
subregion.  The  algorithms  are  therefore,  in  a  certain  sense,  almost  optimal. 

The  paper  is  organized  as  follows.  After  introducing  two  elliptic  model  problems 
and  certain  finite  element  methods  in  Section  2,  we  begin  Section  3  by  reviewing 
Schwarz's  alternating  algorithm  in  its  classical  setting.  Following  Sobolev  [50]  and  P. 
L.  Lions  [34],  we  indicate  how  this  algorithm  can  be  expressed  in  a  variational  form. 
Since  this  formulation  is  very  convenient  for  the  analysis  of  finite  element  problems, 
we  work  in  this  Hilbert  space  setting  throughout  the  paper. 

In  Section  3,  we  also  introduce  the  additive  variant  of  Schwarz's  method;  cf  Dryja 
[21],  Dryja  and  Widlund  [26],  [28],  Matsokin  and  Nepomnyaschikh  [41]  and  Nepom- 
nyaschikh  [44].  We  present  it  in  a  general  form  and  show  how  algorithms  of  this  kind 
can  be  defined  in  terms  of  a  set  of  subspaces  and  projections.  An  additive  Schwarz 
method  can  be  viewed  as  an  iterative  method  for  the  solution  of  an  auxiliary  linear 
problem  that  has  the  same  solution  as  the  original  finite  element  problem.  We  also 
introduce  tools,  which  make  it  possible  to  estimate  the  condition  number  k(P)  of  the 
operator  P  of  this  new  problem. 

It  is  often  straightforward  to  obtain  an  upper  bound  for  the  spectrum  of  P.  A 
lower  bound  of  the  eigenvalues  is  obtained  in  terms  of  an  upper  bound  of  a  Rayleigh 
quotient  which  measures  the  extent  by  which  the  subspaces  are  linearly  independent. 
(If  the  subspaces  are  orthogonal,  the  Schwarz  method  converges  in  one  step.)  We 
show  that  preconditioners  can  be  used  as  approximate  solvers  for  the  linear  systems, 
which  are  related  to  the  individual  projections,  and  that  it  is  easy  to  obtain  bounds 
for  the  condition  number  of  the  resulting  algorithm  in  terms  of  the  spectrum  of  the 
basic  operator  and  any  bounds  for  the  individual  preconditioned  problems  that  might 
be  available. 

Section  3,  thus,  provides  us  with  a  basic  framework.  In  the  rest  of  the  paper,  we 
turn  to  a  series  of  applications. 

In  Section  4,  we  summarize  our  work  on  additive  Schwarz  methods  of  a  type 
directly  related  to  the  classical  case.  For  previous  discussions  of  this  algorithm,  see 
Dryja  [21],  Dryja  and  Widlund  [26],  [28],  with  the  most  details  given  in  [28].  We  note 
that  this  method  has  been  extended  successfully  to  certain  stationary  and  parabolic 
convection-diffusion  problems  in  the  dissertation  of  Xiao-Chuan  Cai  [20], [19],  and  to 
mixed  finite  element  methods  of  the  Raviart-Thomas  type,  cf.  [46],  in  the  dissertation 
of  Tarek  Mathew  [40].  In  this  paper,  we  are  unable  to  provide  details  about  their 
work  since  it  would  require  the  introduction  of  many  notations,  etc.  We  also  discuss 
a  result  on  multilevel  algorithms  obtained  in  collaboration  with  Xuejun  Zhang,  a 
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Courant  Institute  graduate  student. 

In  Section  5,  we  discuss  a  well-known  family  of  domain  decomposition  methods 
known  as  iterative  substructuring  methods;  cf.  e.g.  Agoshkov  [1],  Bjorstad  and  Hvid- 
sten  [6],  Bjorstad  and  Widlund  [8],  Bramble,  Pasciak  and  Schatz  [18],  [15],  [16],  [14], 
[17],  Dryja  [24],  Dryja,  Proskurowski  and  Widlund  [25],  Dryja  and  Widlund  [28],  Lebe- 
dev  [32],  Marchuk,  Kuznetsov  and  Matsokin  [39],  Quarteroni  [45],  Smith  and  Widlund 
[49]  and  Widlund  [54].  These  algorithms  are  based  on  a  non-overlapping  subdivision 
{fi,-  }  of  the  region  f2.  (  Borrowing  a  term  from  structural  engineering,  the  subregions 
are  often  called  substructures.)  Surprisingly  enough,  as  shown  already  in  Dryja  and 
Widlund  [28],  many  of  these  methods  also  fit  well  in  the  Schwarz  framework.  In  this 
paper,  we  further  extend  our  analysis  and  include  a  more  detailed  discussion  of  the 
three-dimensional  case.  In  this  study,  it  is  fruitful  to  take  a  substructure  by  sub- 
structure view,  estimating  the  contribution  to  the  strain  energy  from  an  individual 
substructure  in  terms  of  the  corresponding  local  contribution  to  the  preconditioner. 
This  approach  is  similar  to  that  of  Jan  Mandel,  and  his  coworkers  Babuska,  Craig  and 
Pitkaranta,  cf.  [2],  [37],  [36],  [35],  who  have  begun  a  systematic  study  of  domain  de- 
composition methods  for  p-version  finite  element  methods;  cf.  also  Bramble,  Pasciak 
and  Schatz  [17]  for  a  fundamental  study  of  h-version  methods.  This  approach  high- 
lights how  preconditioned  can  be  constructed  from  parts  which  are  strictly  local  with 
respect  to  an  individual  substructure,  parts  which  involve  interaction  between  pairs 
of  neighboring  substructures  and  a  coarse  global  model  with  relatively  few  degrees  of 
freedom.  Many  of  the  results  which  were  obtained  for  two  subregions  in  the  early  de- 
velopment of  the  theory  can  now  be  recycled  to  yield  useful  results  for  the  much  more 
interesting  case  of  many  substructures.  We  note  that  some  of  our  results  have  been 
extended  to  a  class  of  two-dimensional,  stationary  and  parabolic  convection-diffusion 
problems  by  Xiao-Chuan  Cai  [20],  [19]. 

In  Section  6,  we  consider  Schwarz  methods  on  the  lower  dimensional  manifolds 
formed  by  the  curves  or  surfaces,  which  partition  the  domain  into  substructures.  These 
algorithms  can  also  be  viewed  as  Schwarz  methods  for  problems  of  potential  theory. 
This  idea  has  previously  been  discussed  by  Nepomnyaschikh  [44]  for  the  case  of  a  few 
substructures  and  without  a  full  development  of  the  theory.  We  include  a  discussion 
and  estimates  of  the  rate  of  convergence  of  a  method  developed  by  Bourgat,  Glowinski, 
Le  Tallec  and  Vidrascu  [11]  and  also  discuss  some  recent  work  by  Barry  Smith,  a 
Courant  Institute  graduate  student;  cf.  [48]. 

In  Section  7,  we  survey  our  recent  work  on  iterative  refinement  methods.  These 
are  methods  for  the  solution  of  the  linear  systems  of  algebraic  equations,  which  arise 
from  elliptic  finite  element  problems  defined  on  composite  meshes.  The  study  of  these 
methods  was  pioneered  by  McCormick  and  his  coworkers  Hart  and  Thomas  [31],  [42], 
[43].  An  analysis  of  two-level  algorithms  is  given  in  Bramble,  Ewing,  Pasciak  and 
Schatz  [13]  and  in  Mandel  and  McCormick  [38].  Proofs  of  the  results  discussed  in  this 
section  are  given  in  Dryja  and  Widlund  [27]  and  Widlund  [56].  The  main  difficulties  in 
this  work  are  related  to  an  effort  of  obtaining  bounds  for  the  rate  of  convergence  which 
are  independent  of  the  number  of  refinement  levels  as  well  as  the  number  of  degrees 
of  freedom.  We  note  that  in  the  dissertation  of  Tarek  Mathew  [40],  certain  results  are 
obtained  on  iterative  refinement  methods  for  Ravi  art -Thomas  finite  elements. 

In  this  paper,  we  primarily  consider  algorithms  that  are  of  additive  Schwarz  type. 
We  note  that  a  variety  of  multiplicative  algorithms  also  can  be  defined  systematically. 
At  present,  the  general  multiplicative  case  appears  to  be  less  well  understood  than 
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the  additive,  except  in  the  case  of  two  subspaces.  For  the  two  subspace  case,  Bjorstad 
and  Mandel  [7]  have  recently  extended  their  earlier  work,  cf.  [5]  and  [38],  obtaining  a 
detailed  comparison  of  the  spectra  of  the  additive  and  multiplicative  algorithms.  We 
note  that  there  are  cases  for  which  a  satisfactory  theory  already  exists  in  the  case  of 
more  than  two  subspaces.  Thus  the  algorithm  of  Bank,  Dupont  and  Yserentant  [3]  is  a 
multiplicative  variant  of  the  original,  additive  algorithm  of  Yserentant's  [58];  it  is  just 
as  well  understood.  Similarly,  the  basic  theory  for  the  multilevel  iterative  refinement 
methods  has  advanced  to  the  same  level  in  the  two  cases;  cf.  Widlund  [56]  and  Dryja 
and  Widlund  [27].  We  note,  finally,  that  in  the  recent  thesis  of  Mathew  important 
progress  is  reported  for  the  general  multiplicative  case;  cf.  Mathew  [40]  and  a  remark 
in  Section  3. 

2.  Model  Problems  and  Finite  Elements.  In  this  section,  we  introduce  finite 
element  approximations  of  a  standard  Poisson  equation  and  a  special  second  order 
elliptic  problem  with  variable  coefficients. 

In  the  model  problems,  the  continuous  and  discrete  problems  are  of  the  form 

a(u,v)  =  f(v),    V  v  G    V  , 
and 

(1)  a(uh,vh)  =  f{vh),    Vvhe    Vh  , 

respectively.  We  consider  homogeneous  Dirichlet  problems  on  bounded  Lipschitz  re- 
gions in  two  or  three  dimensions  and  continuous,  piecewise  linear  finite  elements.  For 
the  Poisson  problem,  the  bilinear  form  is  defined  by 


(2)  a(u,v)=   I 

Jn 


Vu  •  Vt;  dx  . 


This  form  defines  a  semi-norm  |u|//i(£h  =  (a(u,u))1/2  in  the  Sobolev  space  Hl(?L).  It 
is  a  norm  of  V  =  Hq(CI).  Here  Hq(CI)  is  the  subspace  of  H1(Q.)  functions  with  zero 
trace;  aD  elements  of  V  and  its  subspace  Vh  vanish  on  dfi,  the  boundary  of  ft.  We 
note  that  if  ft  C  ft,  then  any  element  of  Hq(Q)  can  be  extended  by  zero  to  an  element 
in  i?o(ft),  and  that  the  extension  operator  is  continuous.  We  therefore  regard  Hq(CI) 
as  a  subspace  of  Hq(Q). 

Almost  all  our  results  can  be  extended  immediately  to  general  conforming  finite 
element  approximations  of  any  self-adjoint  elliptic  problem,  which  can  be  formulated 
as  a  minimization  problem,  and  to  more  general  boundary  conditions.  However,  some 
of  the  bounds  can  be  very  poor  if  there  is  a  great  variation  in  the  values  of  the 
coefficients.  In  this  paper,  we  consider  only  one  generalization, 

(3)  a(u,  v)  =   /   p(x)'Vu  •  Vv  dx  , 

Jq 

where  p(x)  >  0  can  be  discontinuous,  but  varies  slowly  in  each  subregion.  There  is 
experimental  evidence  that  some  domain  decomposition  methods  of  Neuman-Dirichlet 
type  perform  quite  poorly  for  certain  problems  of  the  form  (3);  cf.  Greenbaum,  Li 
and  Chao,  [30].  For  other  algorithms  a  satisfactory  theory  already  exists  with  bounds 
which  only  depend  on  the  local  variation  of  p(x);  see  Section  5.  For  still  others,  this 
issue  is  still  to  be  fully  addressed. 
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The  triangulation  of  Q  is  introduced  in  the  following  way.  The  region  is  first 
divided  into  non-overlapping  substructures  Q,  ,  i  =  1,---,7V.  We  assume  that  the 
substructures  are  chosen  so  that  the  discontinuities  of  p(x)  occur  only  at  substructure 
boundaries.  To  simplify  the  description,  we  confine  most  of  our  study  to  triangular 
(simplicial)  substructures.  In  such  a  case,  the  original  region  must,  of  course,  be  a 
polygon  (polyhedron).  We  note  that  the  rate  of  convergence  of  the  iterative  methods, 
which  are  considered  in  this  paper,  depend  only  very  mildly  on  the  size  of  the  sub- 
structures. The  size  of  the  subregions  can  therefore  be  selected  primarily  to  minimize 
the  cost  of  a  single  iteration  on  a  particular  computer  system. 

All  the  substructures  Q,  are  further  divided  into  elements.  The  common  assump- 
tion in  finite  element  theory  that  all  elements  are  shape  regular  is  adopted  and  the 
same  assumption  is  made  concerning  the  substructures.  On  the  element  level  this 
means  that  there  is  a  uniform  bound  on  /iA'/rA'  >  which  is  independent  of  the  number 
of  degrees  of  freedom.  Here  fix  is  the  diameter  of  the  element  K  and  r^  the  diameter 
of  the  largest  inscribed  sphere  in  K. 

Since  a(u^,  Vh)  =  a(u,  u/J,  V  vh.  6  V  \  the  finite  element  solution  is  the  projection 
of  the  exact  solution  onto  the  finite  element  space  with  respect  to  the  inner  product 
defined  by  the  bilinear  form.  The  problems  defined  on  the  subregions,  from  which  pre- 
conditioners  for  the  entire  problem  can  be  assembled,  can  often  similarly  be  viewed  in 
terms  of  orthogonal  projections  onto  subspaces  directly  associated  with  the  subregion 
in  question. 

We  also  need  to  use  matrix  representations  of  the  finite  element  problems  since 
many  algorithmic  details  can  best  be  described  using  matrix  notations.  When  doing 
this,  it  is  convenient  to  consider  domain  decomposition  in  light  of  structural  engi- 
neering computational  practices.    The  elements  of  the  stiffness  matrix  K  are  given 

by 

ki.m  =  a(w,<pm)  , 

where  <pi  and  v?m  are  standard  finite  element  basis  functions.  Since  an  integral  over  Q. 
can  be  written  as  a  sum  of  integrals  over  the  substructures,  the  stiffness  matrix  can 
be  assembled  from  the  stiffness  matrices  K^'\  which  have  the  elements 

fc!,m  =  flfi,(W»Vm)  • 

Here  the  relevant  /  and  m  correspond  to  degrees  of  freedom  associated  with  the  closure 
of  the  substructure  0.,.  The  bilinear  form  ctn,(t£/,,  v/,)  represents  the  contribution  to 
the  integral  an{uh,Vh)  from  the  substructure  fi,. 

This  so-called  subassembly  process  can  be  summarized  in  the  formula 

(4)  xTKy  =  J2*®TK®lPif 

where  x^  is  the  subvector  of  parameter  values  associated  with  fit. 

Remark.  In  practice,  the  subassembly  process  is  often  used  recursively,  creating 
larger  and  larger  so-called  super  elements.  This  process  can  be  interleaved  with  the 
elimination  of  all  the  variables  which,  at  the  level  in  question,  are  coupled  only  to 
other  variables  of  the  same  super  element;  cf.  e.g.  [4].  Except  for  a  brief  discussion  in 
Section  4,  we  consider  only  three  levels  in  this  paper:  the  elements  with  a  characteristic 
diameter  h,  the  substructures  with  a  diameter  on  the  order  of  H  and  the  entire  region 
f2,  which,  without  loss  of  generality,  is  assumed  to  have  unit  diameter. 
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If  we  divide  the  sub-vectors  x^  associated  with  the  i-th  substructure  into  two,  Xj 
and  Xg  ,  corresponding  to  the  variables  which  are  interior  to  the  substructure  and 
those  which  are  shared  with  other  substructures,  then  the  matrix  K^  can  be  written 
as 

/    E®      K(t) 


II  1YIB 
(i)T  tA>) 
IB        nBB 


Since  the  interior  variables  are  associated  with  only  one  of  the  substructures,  they 
can  be  eliminated  locally  and  in  parallel.  The  reduced  matrix  is  a  so-called  Schur 
complement  and  has  the  form 

15        -  nBB       ivIB    A//       n IB- 
It  is  now  easy  to  show  that  if  the  corresponding  Schur  complement  of  the  global 
stiffness  matrix  K  is  denoted  by  5,  then, 

(5)  xlSyB  =  YJ*(fs"y{$. 

The  elimination  of  the  interior  variables  from  the  substructures  can  be  viewed  in 
terms  of  orthogonal  projections,  with  respect  to  the  bilinear  form,  of  the  solution  Uh 
of  equation  (1)  onto  the  subspaces  /?q(J2,-)  V\Vh->  *  =  1,  •  •  ■  ,iV".  It  is  easy  to  show  that 
these  subspaces  are  orthogonal,  in  the  sense  of  the  bilinear  form  (4),  to  the  so-called 
piecewise  discrete  harmonic  functions  given  by 

K$xf  +  JffJJ«ff  =  0,    Vt. 

If  the  local  problems  are  solved  exactly,  what  remains  is  to  find  a  sufficiently  ac- 
curate approximation  of  the  part  of  the  solution  which  is  piecewise  discrete  harmonic. 
This  can  be  done  by  approximately  solving  the  reduced  linear  system  with  the  matrix 
S.  A  specific  iterative  substructuring  method  is  obtained  by  selecting  a  precondi- 
tioner  for  the  matrix  S.  Once  an  approximation  of  the  solution  has  been  found  on  the 
boundaries  of  the  substructures,  the  solution  can  be  found  everywhere  by  separately 
solving  local  Dirichlet  problems  on  each  substructure. 

An  important  family  of  domain  decomposition  algorithms  can  be  derived  by  re- 
placing each  contribution  S^  to  the  Schur  complement  by  a  different  matrix  §W.  A 
preconditioner  S  of  S  is  then  created  by  assembling  the  local  contributions  in  the  same 
way  as  in  equation  (5).  When  choosing  5,  it  is  important  to  create  a  linear  system, 
which  is  cheaper  to  solve  and,  at  the  same  time,  to  ensure  that  k(S~1S)  is  small.  A 
natural  idea  is  to  eliminate  the  coupling  between  the  groups  of  variables  associated 
with  different  edges  (faces)  of  the  individual  substructures.  Further  details  are  given 
in  Dryja  and  Widlund  [28],  where  this  process  is  interpreted  as  a  splitting  in  the  sense 
of  Varga  [51].  In  the  analysis,  a  simple  and  powerful  idea  is  based  on  the  fact  that  if 

(7)  e,«j?T$»xg    <  x^V'xg    <  Ci«t°T^°*l?  .  Vxff, 

then, 

(8)  CXgSxB    <  Xg5xB    <  C  x'gSiB  ,  VxB, 


where 

(9)  c   =    min,  c,     and     C    =    max,  C,  . 

When  this  technique  can  be  used,  it  is  as  easy  to  derive  satisfactory  bounds  for 
equations  (3)  as  for  (2);  cf.  Section  5. 

3.  Multiplicative  and  Additive  Schwarz  Methods.  In  this  section,  we  first 
review  Schwarz's  classical  alternating  method  [47],  its  variational  formulation,  cf.  P.  L. 
Lions  [34],  and  an  additive  variant  of  the  algorithm,  cf.  Dryja  [21],  Dryja  and  Widlund 
[26]  and  Matsokin  and  Nepomnyaschikh  [41].  We  then  discuss  a  general  approach  to 
the  analysis  of  the  rate  of  convergence  of  the  additive  algorithms. 

We  begin  by  briefly  discussing  the  classical  formulation  of  Schwarz's  method  in  the 
case  of  the  continuous  Poisson  equation.  There  are  two  fractional  steps  corresponding 
to  two  overlapping  subregions,  Cl[  and  Q'2,  the  union  of  which  is  the  region  fi  .  Let  an 
initial  guess  u°  6  V  be  given.  The  iterate  un+1  is  determined  from  un  by  sequentially 
updating  the  approximate  solution  in  the  two  subregions: 

_Aun+i/2     _     f  inn,^ 

un+l/2       _       un  ond£l[, 


and 


-Aun+1     =     /  inW2, 


u  ^       =     u 


l2 


We  could  just  as  well  have  written  down  the  finite  element  version  of  the  algorithm. 
From  now  on,  we  only  consider  that  case.  It  is  easy  and  convenient  to  describe  this 
method  in  terms  of  two  projections  P,,  i  =  1,2,  onto  V/1  =  HQ(Q'i)f]Vh  ;  cf.  Lions 
[34].  The  projections  are  defined  by 

(10)  a(P,vh,<j>h)  =  a(vh,<f>h),  V<^  g  V-*  . 

It  is  also  easy  to  show  that  the  error  propagation  operator  of  this  multiplicative 
Schwarz  method  is 

(/  -  P2)(I  -  Px). 

This  algorithm  can  therefore  be  viewed  as  a  simple  iterative  method  for  solving 

(ID  (P1  +  P2  -  P2Pl)uh  =  gh, 

with  an  appropriate  right-hand  side  g^  . 

This  algorithm  can  be  generalized  immediately  to  any  number  of  subspaces.  Let 

vh  =  vx  +  v2  + . . .  +  vN  . 

The  subspaces  V,  can  be  chosen  quite  arbitrarily.   The  projections  are  defined  as  in 
equation  (10). 

The  operator  of  (11)  is  a  polynomial  of  degree  two  and,  thus,  the  method  is  not 
ideal  for  parallel  computing,  since  two  sequential  steps  are  involved.  If  more  than 
two  subspaces  are  used,  this  effect  is  further  pronounced  even  if  the  degree  of  the 
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polynomial  representing  the  multiplicative  algorithm  often  is  lower  than  maximal. 
This  is  so  because  a  product  of  two  projections  associated  with  subspaces  that  do 
not  intersect  nontrivially  vanishes;  cf.  further  Widlund  [52],  [55].  This  can  best 
be  described  in  terms  of  the  coloring  of  an  undirected  graph  in  which  the  nodes 
represent  the  subspaces  and  the  edges  nontrivial  intersections  of  pairs  of  subspaces. 
The  problems  associated  with  all  subspaces  of  the  same  color  can  be  solved  in  parallel 
since  those  subspaces  are  mutually  orthogonal.  For  the  purpose  of  theory,  we  can  then 
regard  all  subspaces  of  the  same  color  as  one  subspace;  this  can  greatly  improve  the 
upper  bound  for  the  eigenvalues  of  the  operator  P,  which  we  are  about  to  introduce. 
The  basic  idea  behind  the  additive  form  of  the  algorithm  is  to  work  with  the 
simplest  possible  polynomial  in  the  projections.  The  equation 

(12)  Puh  =  (P1  +  P2  +  ....  +  PN)uh  =  g'h  , 

is  solved  by  an  iterative  method.  (In  fact,  we  could  just  as  well  work  with  J2i=i  °iP) 
where  a,  >  0  are  suitably  chosen  constants.)  Since  we  can  show  that  the  operator  P 
is  symmetric  and  positive  definite,  with  respect  to  the  bilinear  form,  the  method  of 
choice  is  the  conjugate  gradient  method.  Equation  (12)  must  have  the  same  solution  as 
equation  (1),  i.e.  the  correct  right-hand  side  must  be  found.  Since  a{uh,4>h)  =  f(4>h)  ? 
by  equation  (1),  the  right-hand  side  g'h  can  be  constructed  by  solving  equation  (10)  for 
all  values  of  i  and  adding  the  results.  It  is  similarly  possible  to  apply  the  operator  P  of 
equation  (12)  to  any  element  of  Vh  by  applying  each  projection  P,  to  the  element  and 
adding  the  results.  Most  of  the  work,  in  particular  that  which  involves  the  individual 
projections,  can  be  carried  out  in  parallel.  We  note  that  an  additive  Schwarz  algorithm 
is  fully  specified  by  its  subspaces.  In  this  survey  paper,  we  will  sometimes  just  specify 
the  subspaces  without  much  discussion  of  how  the  corresponding  algorithm  can  be 
implemented. 

It  is  well  known  that  the  number  of  steps  required  to  decrease  an  appropriate 
norm  of  the  error  of  a  conjugate  gradient  iteration  by  a  fixed  factor  is  proportional  to 
•v/k  ,  where  k  is  the  condition  number  of  the  relevant  operator;  see  e.g.  Golub  and  Van 
Loan  [29]  .  We  therefore  need  to  establish  that  the  operator  P  of  equation  (12)  is  not 
only  invertible  but  also  that  satisfactory  upper  and  lower  bounds  on  its  eigenvalues 
can  be  obtained. 

An  upper  bound  for  the  eigenvalues  of  P  is  given  by  TV  since  P  is  the  sum  of 
projections.  By  combining  all  subspaces  of  the  same  color  into  one,  as  indicated 
above,  this  bound  can  be  improved  by  replacing  N  by  the  number  of  colors  of  any 
coloring  of  that  graph.  For  several  algorithms  this  results  in  an  upper  bound,  which 
is  independent  of  the  number  of  subspaces.  In  other  instances,  a  useful  technique 
is  provided  by  strengthened  Cauchy  inequalities;  cf.  Mandel  and  McCormick  [38], 
Widlund  [56]  and  Yserentant  [58]. 

A  lower  bound  can  often  be  obtained  conveniently  by  using  a  lemma,  inspired  by 
Lions  [34];  the  simple  proof  is  also  given  in  Widlund  [56]. 

LEMMA  1.  Let  u^  =  J2t=i  uh,U  where  Uh,i  G  V{,  be  a  representation  of  an  element 
ofVh  =  V\  +  ...  +  Vn.  If  the  representation  can  be  chosen  so  that  YliL\  a(u/i,Mu/»,i)  < 
C$a(uk,uh),Vuh  G  Vh,  then  A^P)  >  Cq2 . 

Remark.  If  we  expand  individual  subspaces,  there  is  a  larger  choice  in  selecting 
Ufctt  G  Vi.  The  best  bound  for  Co  can  then  only  improve.  If  we  can  expand  the 
subspaces  without  worsening  the  upper  bound,  which  is  often  possible,  our  estimate 
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of  k(P)  improves.  On  the  other  hand,  an  enlarged  subspace  also  means  that  the 
subproblem  has  more  variables  and  that  it  is  worse  conditioned.  For  the  special  case 
of  the  classical  Schwarz  method  on  two  regions,  this  tradeoff  is  well  understood;  for  a 
discussion  of  precise  estimates  of  the  rate  of  convergence;  see  Bjorstad  and  Widlund 
[10]. 

Remark.  In  his  recent  dissertation,  Tarek  Mathew  [40]  has  given  a  bound  on  the 
rate  of  convergence  of  the  multiplicative  algorithm  in  terms  of  C0  and  the  number 
of  colors  of  the  graph.  While  his  bound  can  probably  be  improved  considerably,  it 
nevertheless  shows  that  if  these  parameters  are  bounded  independently  of  the  mesh 
parameters,  then  the  spectral  radius  of  the  error  propagation  operator  is  uniformly 
bounded  by  a  constant  that  is  less  than  1. 

One  of  the  attractive  features  of  the  framework  introduced  in  this  section  is  the 
ease  by  which  variants  of  the  basic  algorithm,  obtained  by  replacing  the  local  linear 
systems  by  preconditioners,  can  be  analyzed.  Let  us,  for  example,  consider  the  additive 
Schwarz  method  for  a  finite  element  approximation  of  the  problem  discussed  in  the 
beginning  of  this  section.  We  can  write  the  projection  Pa  in  matrix  terms.  After  a 
suitable  permutation  of  the  variables,  it  corresponds  to 

(13)  y  =  Piz=  (KW    °q)kx. 

Here  A'(1)  represents  the  stiffness  matrix  of  the  Dirichlet  problem  on  ftj.  It  is  easy  to 
see  that  the  matrix  of  (13)  is  symmetric  in  the  A'-inner  product  that  corresponds  to 
the  bilinear  form.  If  A'(~]  is  replaced  by  A'~j,  etc.,  then  it  is  easy  to  show  by  using  a 
Rayleigh  quotients  argument  that  the  eigenvalues  of  the  resulting  operator  P  satisfy 

(14)  ^n(P)m}n(Xmn(K(,)K^))   <    A^P), 
and 

(15)  A^CP)   <    Xm&x(P)m3.x(Xm&x(K(i)K-^))  . 

An  estimate  of  k(P)  follows  immediately, 

us)  MP)<max,wg(0*»\n 

mm,  Amin(A(,)A'(i)1) 

4.  Additive  Schwarz  Methods  on  Overlapping  Subregions.  We  now  de- 
scribe the  additive  Schwarz  method  introduced  in  Dryja  and  Widlund  [26],  [28];  cf. 
also  Dryja  [21].  In  those  papers  only  Poisson's  equation  was  considered.  In  the  origi- 
nal technical  report,  good  results  were  given  only  for  problems  in  two  dimensions,  but 
we  now  need  no  such  restrictions.  The  most  satisfactory  proof,  so  far,  is  given  in  [28]. 
Here,  we  also  discuss  problems  of  the  form  (3). 

We  start  with  the  same  triangular  (simplicial)  non-overlapping  substructures  ft, 
that  have  been  considered  before.  We  extend  each  substructure  to  a  larger  region 
ft'We  assume  that  the  overlap  is  generous,  assuming  that  the  distance  between  the 
boundaries  dft,  and  dft(  is  bounded  from  below  by  a  fixed  fraction  of  #,,  the  diameter 
of  ft, .  We  also  assume  that  dil[  does  not  cut  through  any  element.  We  make  the  same 
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construction  for  the  substructures  that  are  next  to  the  boundary  except  that  we  cut 
off  the  part  of  &.[  that  is  outside  of  Q  . 

Remark.  The  analysis  of  Schwarz  methods  is  more  complicated  when  the  overlap 
is  less  generous;  cf.  a  discussion  in  Lions  [34].  Such  a  situation  occurs  if  the  region 
is  L-shaped  and  partitioned  into  two  overlapping  rectangles.  A  similar  situation  is 
discussed  in  Dryja  and  Widlund  [28]  in  an  analysis  of  iterative  substructuring  methods; 
cf.  also  Section  5. 

Our  finite  element  space  is  represented  as  the  sum  of  N+l  subspaces 

vh  =  v0h  +  v1h  +  ...  +  v^. 

The  first  subspace  Vq  is  equal  to  VH ,  the  space  of  continuous,  piecewise  linear  func- 
tions on  the  coarse  mesh  defined  by  the  substructures  $7,  .  The  other  subspaces  are 
related  to  the  subdomains  in  the  same  manner  as  in  a  traditional  Schwarz  algorithm, 

i.e.  v?  =  #$(«{)  n^*. 

Remark.  There  are  few  differences  between  the  problem  related  to  the  first  sub- 
space  and  the  others.  The  right  hand  side  of  the  particular  linear  system  is  generated 
as  weighted  averages  with  weights  determined  by  the  coarse  mesh  basis  functions.  We 
also  need  to  interpolate  the  solutions  in  Vq  when  combining  them  with  the  contri- 
butions from  the  other  subspaces.  The  coarse,  global  approximation  of  the  elliptic 
equation  is  otherwise  quite  similar  to  the  local  problems. 

The  global  coarse  problem  provides  a  mechanism  for  the  global  transportation 
of  information.  As  shown  in  Widlund  [54],  the  rate  of  convergence  of  any  domain 
decomposition  method,  for  which  the  interaction  is  only  through  next  neighboring 
subdomains,  has  a  condition  number  which  grows  at  least  as  fast  as  l/H2. 

The  following  result  is  established  in  Dryja  and  Widlund  [28]. 

Theorem  1.  The  operator  P  of  the  additive  algorithm  defined  by  the  spaces  VH 
and  Vj,  applied  to  equation  (2),  satisfies  the  estimate  k(P)  <  const. 

In  the  proof,  a  quasi-interpolant,  Ih  ,  which  is  a  bounded  operator  in  Hq(CI)  and 
satifies 

\\uh  -  lHUh\\L2(Q)  <  const. H\uh\Hi{Q), 

is  used.  By  examining  the  proof,  it  is  easy  to  see  that  if  uniform  bounds  of  the  same 
form  hold  for  the  weighted  spaces  L2,p(£l)  and  Hq  -(ft),  defined  by 


Lp(x)\u\2dx      and       J    p(x)\Vu\2dx, 


respectively,  then  the  optimality  also  holds  for  the  variable  coefficient  case  (3).  This  is 
a  difficult  problem,  still  not  fully  understood;  cf.  Xu  [57]  for  partial  results.  We  note 
that  for  two-dimensional  problems  the  proof  given  in  Dryja  and  Widlund  [26]  works 
just  as  well  for  equation  (3).  The  bound  of  the  condition  number  given  in  that  paper 
is  proportional  to  (1  +  log{H/h)). 

We  have  also  considered  the  use  of  more  than  two  levels  of  partitioning  of  the  re- 
gion and  established  that  for  equation  (2)  k(P)  does  not  grow  faster  than  quadratically 
in  the  number  of  levels.  We  do  not  know  if  this  result  can  be  improved.  For  convex 
regions,  we  have  shown  that  the  growth  is  linear  in  the  number  of  levels.  This  result 
cannot  be  improved.  Our  work  has  been  carried  out  jointly  with  Xuejun  Zhang,  who 
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has  also  made  a  large  number  of  numerical  experiments.  Rapid  convergence  can  be 
obtained  when  using  such  methods  solving  only  very  small  linear  systems  of  equations. 
As  we  have  already  pointed  out  in  the  introduction,  Xiao-Chuan  Cai  [20],  [19]  and 
Tarek  Matew  [40]  have  extended  our  analysis  to  certain  nonsymmetric  and  indefinite 
problems. 

5.  Iterative  Substructuring  Methods  as  Schwarz  Methods.  In  this  sec- 
tion, we  show  how  iterative  substructuring  methods  can  be  derived  by  using  the 
framework  of  subspaces  and  projections  developed  in  Section  3.  We  first  consider 
problems  in  the  plane  and  give  an  estimate  of  the  condition  number  of  the  operator 
P  that  corresponds  to  a  specific  choice  of  subspaces.  For  this  basic  algorithm,  the 
proof  of  the  estimate  does  not  rely  on  the  extension  theorems,  which  have  played  an 
important  role  in  previous  estimates  for  iterative  substructuring  methods;  cf.  Wid- 
lund  [53].  We  then  indicate  how  results  for  the  special  case  of  two  substructures,  and 
an  argument  about  preconditioned  for  the  subspace  problems,  can  be  used  to  derive 
different  algorithms,  including  one  due  to  Bramble,  Pasciak  and  Schatz  [15].  We  also 
consider  a  second  choice  of  subspaces  and  show  that  the  resulting  algorithm  extends 
to  a  fast  method  for  the  three-dimensional  case.  The  resulting  algorithm  is  similar  but 
not  identical  to  one  of  the  two  algorithms  considered  in  Bramble,  Pasciak  and  Schatz 
[17];  cf.  also  Dryja  [24]  for  early  work  on  the  three-dimensional  case.  The  discussion 
in  this  section  extends  that  in  Dryja  and  Widlund  [28]  and  relies  in  part  on  ideas  from 
Mandel  [36]. 

We  assume  that  the  region  is  divided  into  substructures  and  elements  as  in  Section 
2.  We  first  consider  the  two-dimensional  case  and  introduce  an  additive  Schwarz 
method.  We  use  the  coarse  space  VH  introduced  in  Section  4  and  let  the  subregions 
Qij  =  ^i  U  r,j  (J  fij  P^y  the  same  role  as  Q.'{  in  Section  4.  Here  fit  and  Qj  are  adjacent 
substructures  with  a  common  edge  Tij.  The  local  subspaces  V&  are  thus  defined  as 

Compared  with  the  subspaces  used  in  the  previous  section,  we  use  less  overlap 
in  the  sense  that  only  the  elements  of  VH  can  differ  from  zero  at  the  vertices  of  the 
substructures.  This  is  reflected  in  a  poorer  bound  on  the  condition  number  when  we 
work  with  Lemma  1. 

Theorem  2.  In  the  two-dimensional  case,  the  operator  P  of  the  additive  algorithm 
defined  by  the  spaces  VH  and  Vfi  satisfies  the  estimate  k(P)  <  const.(l  +  log(H/h))2 
for  equations  (2)  and  (3).  The  constant  is  independent  of  h,  H  and  the  discontinuities 
of  the  coefficient  p(x). 

For  this  algorithm,  Amax(P)  <  4  ;  cf.  Dryja  and  Widlund  [28].  It  is  easy  to  see 
that  no  more  than  four  colors  are  needed  to  color  the  graph;  each  interior  substructure 
is  covered  exactly  three  times  by  the  subspaces  associated  with  its  three  edges  and, 
in  addition,  we  have  the  global  space  V"  which  intersects  all  other  subspaces.  In  our 
proof  of  the  lower  bound  of  the  spectrum  of  P,  given  fully  in  [28],  we  use  Lemma  1 
and  the  following  lemma,  which  also  plays  an  important  role  in  the  more  traditional 
theory  of  iterative  substructuring  algorithms. 

Lemma  2.   Let  a  be  any  convex  combination  of  values  o/u/,(x),x  6  fit'-  Then 

Hu*  _  QUl~(ft,)  £  const.(l  +  \og(H/h))\uh\2HHrit)  ■ 
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